Skip to content

Sections 42–46 of the ISLE Architecture. For the full table of contents, see README.md.


Part XI: AI/ML and Accelerators

Unified accelerator framework, heterogeneous memory, isolation, in-kernel inference, and GPU compatibility.


42. Unified Accelerator Framework

This section defines kernel-level AI/ML infrastructure: unified accelerator management, heterogeneous memory, peer-to-peer DMA, multi-tenant isolation, in-kernel inference, and distributed ML networking. These are kernel-internal capabilities — training frameworks and model serving remain in userspace.

Design Constraints:

  1. Drop-in compatibility: Existing Linux GPU/accelerator userspace (CUDA runtime, ROCm, OpenCL, Vulkan compute, /dev/dri, /dev/nvidia*) must work through the compatibility layer. Existing tools must not break.
  2. Superset, not replacement: ISLE exposes additional accelerator management capabilities beyond what Linux provides. Userspace software that wants better scheduling, isolation, or memory management can opt in. Software that doesn't care sees standard Linux behavior.
  3. IP-clean: Built from public specifications (PCIe, VFIO, DRM/KMS interfaces), academic algorithms, and open standards. No proprietary API reimplementation.

42.1 Motivation

42.1.1 The Current State

In 2026, accelerators (GPUs, NPUs, TPUs, FPGAs, custom ASICs) are as fundamental to computing as network interfaces. Every cloud instance, every phone, every laptop has at least one. AI/ML workloads are the dominant growth driver in datacenter compute.

Yet the Linux kernel treats accelerators as dumb peripherals. The kernel's relationship to a GPU is roughly:

Linux kernel:
  - Maps PCI BARs (MMIO)
  - Sets up IOMMU
  - Routes interrupts
  - That's it. The driver owns everything else.

GPU driver (e.g., nvidia.ko):
  - Owns the command queue
  - Owns memory management (VRAM allocation, page tables)
  - Owns scheduling (which process runs on which compute unit)
  - Owns isolation (who can see whose memory)
  - Owns power management (GPU clock, thermal)
  - The kernel knows nothing about any of this.

This is equivalent to how operating systems managed CPUs before preemptive multitasking — the hardware vendor's runtime owns the resource, the kernel is blind.

42.1.2 What This Causes

Problem Impact
No kernel-visible GPU scheduling Two containers sharing a GPU have no fairness guarantees. One can starve the other.
No cgroup integration cpu.max limits CPU time. Nothing limits GPU time. Kubernetes has no way to enforce GPU QoS.
No memory accounting GPU VRAM usage is invisible to the OOM killer. A GPU memory leak crashes the GPU driver, not just the leaking process.
No preemption A long-running GPU kernel (training batch) blocks interactive inference. Linux has solved this for CPUs since 1991.
No unified memory management CUDA UVM, AMD SVM, Intel SVM — each vendor's "unified memory" is driver-internal. The kernel's page fault handler doesn't know about GPU page tables.
No isolation GPU memory between processes is isolated by the driver, not the kernel. Driver bugs = security holes.
Driver crash = system crash NVIDIA driver hang requires full system reboot. No crash recovery.
No P2P DMA management GPUDirect is vendor-specific. No kernel facility for "DMA from NVMe directly to GPU VRAM."

42.1.3 The Opportunity

A kernel designed from scratch in 2026 can treat accelerators as first-class scheduled resources, the same way it treats CPUs, memory, and I/O bandwidth. This is not about running PyTorch in the kernel — it is about the kernel managing accelerator hardware the way it manages all other hardware: scheduling, memory, isolation, recovery.

ISLE's existing architecture is uniquely suited for this: - KABI: Stable driver ABI means GPU driver updates are decoupled from kernel updates - Crash recovery: GPU driver crashes can be survived (driver binary reload in ~50-150ms; total recovery including GPU hardware reset: ~100-500ms) - Capability system: Natural fit for fine-grained accelerator access control - Device registry: Models accelerator topology (GPU → engines, VRAM, display, encode) - cgroup integration: CPU bandwidth guarantees (Section 15) extend naturally to accelerator time - Zero-copy I/O paths: Generalize to device-to-device DMA


42.2 Unified Accelerator Framework

42.2.1 Design: isle-accel

A new KABI interface family for accelerator devices. Every accelerator driver — GPU, NPU, TPU, FPGA, DSP, custom ASIC — implements the same base interface. Hardware-specific capabilities are exposed through versioned extension vtables.

                         isle-accel (KABI interface)
                                    |
                    +---------------+---------------+
                    |               |               |
               AccelBase       AccelCompute    AccelDisplay
               (all accel)     (compute-      (display/render
                               capable)        capable)
                    |               |               |
            +-------+-----+    +---+---+       +---+---+
            |       |     |    |       |       |       |
          nvidia  amdgpu  xe  nvidia  amdgpu   nvidia  amdgpu
           GPU     GPU   GPU   GPU     GPU      GPU     GPU
                                |
                            intel_npu
                                |
                            custom_tpu

42.2.2 AccelBaseVTable — The Universal Interface

Every accelerator driver provides this vtable. It covers what the kernel needs to manage the device, not what userspace needs to program it.

/// Base vtable that every accelerator driver must implement.
/// This is the kernel-management interface, not the compute API.
#[repr(C)]
pub struct AccelBaseVTable {
    /// Total size of this vtable struct in bytes. Allows the kernel to detect
    /// when a driver provides a newer (larger) vtable than expected and safely
    /// ignore trailing fields it does not understand.
    pub vtable_size: u64,

    /// KABI version encoded as (major << 16) | minor.
    /// - **Major version** (upper 16 bits): incremented for breaking changes
    ///   (removed fields, changed semantics, reordered vtable entries).
    ///   Kernel and driver must agree on the major version or registration fails.
    /// - **Minor version** (lower 16 bits): incremented for backward-compatible
    ///   additions (new Optional fields appended to the vtable, new enum variants).
    ///   A driver with minor version N works with a kernel expecting minor <= N.
    ///
    /// **Negotiation protocol during driver registration:**
    /// 1. Driver calls `register_driver()` with its vtable (including `version`
    ///    and `vtable_size`).
    /// 2. Kernel extracts the major version: `driver_major = version >> 16`.
    /// 3. If `driver_major != kernel_major`, registration fails with
    ///    `KABI_VERSION_MISMATCH`. The driver must be rebuilt against a
    ///    compatible kernel header.
    /// 4. If `driver_major == kernel_major`, the kernel accepts the vtable.
    ///    Any `Option<fn>` fields beyond the driver's `vtable_size` are treated
    ///    as `None` (not present). Any required (non-Option) fields beyond the
    ///    driver's `vtable_size` cause registration failure.
    /// 5. The negotiated version is recorded in the device registry node for
    ///    diagnostic purposes.
    ///
    /// This protocol applies identically to AccelComputeVTable,
    /// AccelDisplayVTable, RdmaDeviceVTable, and all other KABI vtables
    /// in this section.
    pub version: u32,

    // === Device Info ===

    /// Return device capabilities and resource inventory.
    pub get_info: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_info: *mut AccelDeviceInfo,
    ) -> IoResultCode,

    // === Context Management ===

    /// Create an execution context (one per process/tenant).
    /// Returns an opaque context handle.
    pub create_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        owner_pid: u32,
        priority: AccelPriority,
        limits: *const AccelContextLimits,
        out_context: *mut AccelContextHandle,
    ) -> IoResultCode,

    /// Destroy an execution context and free all its resources.
    pub destroy_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode,

    // === Command Submission ===

    /// Submit a command buffer for execution.
    /// The kernel calls this after validating capabilities and
    /// scheduling the submission according to policy.
    pub submit_commands: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        cmd_buffer: *const u8,
        cmd_len: u32,
        fences: *const AccelFence,
        fence_count: u32,
        out_submission: *mut AccelSubmissionHandle,
    ) -> IoResultCode,

    /// Poll for completion of a submitted command buffer.
    pub poll_completion: unsafe extern "C" fn(
        ctx: *mut c_void,
        submission: AccelSubmissionHandle,
        out_status: *mut AccelCompletionStatus,
    ) -> IoResultCode,

    /// Request preemption or cooperative yield of a running context.
    ///
    /// Behavior depends on `AccelDeviceInfo::preemption_granularity`:
    /// - `Instruction` or `DrawDispatch`: True preemption. The device saves
    ///   context state and stops the running workload within ~50μs-10ms
    ///   (depends on preemption granularity and workload; instruction-level
    ///   preemption on modern compute GPUs is typically 50-100μs, while
    ///   draw/dispatch-level preemption can take milliseconds).
    ///   The context can be resumed later via a new `submit_commands`.
    /// - `CommandBuffer`: The device finishes the current command buffer
    ///   but does not start the next queued one. Latency is bounded by
    ///   the longest in-flight command buffer.
    /// - `None`: Cooperative yield only. The driver stops submitting new
    ///   command buffers after the current one completes. Cannot interrupt
    ///   a running dispatch.
    ///
    /// Returns `IO_OK` if the preemption/yield was initiated (completion
    /// is asynchronous — the scheduler polls for the context to become
    /// idle). Returns `IO_NOT_SUPPORTED` if the driver does not implement
    /// any form of preemption or yield.
    pub preempt_context: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        reason: PreemptReason,
    ) -> IoResultCode>,

    // === Memory Management ===

    /// Allocate device-local memory (VRAM, local SRAM, etc.).
    pub alloc_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        size: u64,
        alignment: u64,
        page_size: AccelPageSize,
        flags: AccelMemFlags,
        out_handle: *mut AccelMemHandle,
    ) -> IoResultCode,

    /// Free device-local memory.
    pub free_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        handle: AccelMemHandle,
    ) -> IoResultCode,

    /// Map device memory into a CPU-visible virtual address.
    pub map_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        handle: AccelMemHandle,
        offset: u64,
        size: u64,
        out_cpu_addr: *mut u64,
    ) -> IoResultCode,

    /// Migrate pages between CPU RAM and device memory.
    /// Direction determined by flags.
    pub migrate_pages: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        pages: *const AccelMigrationEntry,
        page_count: u32,
        flags: MigrationFlags,
    ) -> IoResultCode>,

    // === Utilization Reporting ===

    /// Report current device utilization to the kernel scheduler.
    pub get_utilization: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_util: *mut AccelUtilization,
    ) -> IoResultCode,

    // === Power/Thermal ===

    /// Get current power and thermal state.
    pub get_power_state: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_state: *mut AccelPowerState,
    ) -> IoResultCode,

    /// Set performance level / clock frequency.
    pub set_performance_level: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        level: AccelPerfLevel,
    ) -> IoResultCode>,

    // === Reset ===

    /// Reset a single execution context (not the whole device).
    pub reset_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode,

    /// Full device reset.
    pub reset_device: unsafe extern "C" fn(
        ctx: *mut c_void,
    ) -> IoResultCode,

    // === Completion Notification ===

    /// Register a kernel callback for command completion notification.
    /// Replaces polling for latency-sensitive contexts.
    pub register_completion_callback: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        callback: unsafe extern "C" fn(
            context: AccelContextHandle,
            submission: AccelSubmissionHandle,
            status: AccelCompletionStatus,
        ),
    ) -> IoResultCode>,
}

Callback restrictions: Completion callbacks execute in interrupt context (hardirq on x86, IRQ on ARM). They MUST NOT: - Allocate memory (no Box, Vec, or slab allocation) - Acquire sleeping locks (no Mutex, only SpinLock with try_lock) - Perform I/O (no disk, network, or MMIO beyond the accelerator's own registers) - Call schedule() or any function that may sleep

Callbacks that need to perform complex work (e.g., chaining dependent dispatches, updating shared data structures) must defer to a per-accelerator workqueue thread via accel_defer(closure). The workqueue runs in process context with full kernel capabilities. The completion callback's job is limited to: (1) recording completion status, (2) waking the workqueue if deferred work is pending, and (3) updating per-context statistics (fence value, timing).

42.2.3 Key Data Types

/// Device capability and resource inventory.
#[repr(C)]
pub struct AccelDeviceInfo {
    /// Device class.
    pub device_class: AccelDeviceClass,

    /// Number of hardware compute units in this device.
    /// The definition of "compute unit" is device-class-specific:
    ///   - GPU: Streaming Multiprocessors (SMs) for NVIDIA, Compute Units (CUs) for AMD
    ///   - NPU: neural processing cores
    ///   - FPGA: configurable logic blocks
    ///   - DSP: DSP cores
    /// This is the count reported by the driver via `get_info()` and used by the
    /// AccelScheduler for proportional resource accounting. For MIG/partitioned
    /// devices, this is the compute units assigned to the partition, not the
    /// full device total.
    pub compute_units: u32,

    /// Device-local memory size (bytes). 0 if no local memory.
    pub local_memory_bytes: u64,

    /// Maximum concurrent execution contexts.
    pub max_contexts: u32,

    /// Hardware preemption support.
    pub preemption_granularity: PreemptionGranularity,

    /// Maximum command buffer size (bytes).
    pub max_cmd_buffer_size: u32,

    /// Supported memory types.
    pub memory_types: AccelMemTypeFlags,

    /// PCIe atomics support (needed for fine-grained SVM).
    /// Uses `u8` (0=false, 1=true) instead of `bool` for stable KABI:
    /// C `bool` has implementation-defined size and alignment, making it
    /// unsuitable for cross-compilation-unit ABI boundaries.
    pub pcie_atomics: u8,

    /// Peer-to-peer DMA capability.
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub p2p_capable: u8,

    /// Unified virtual addressing (CPU and device share address space).
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub unified_addressing: u8,

    /// Hardware page fault support (device can fault on unmapped pages).
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub hw_page_faults: u8,

    pub _pad: [u8; 32],
}

#[repr(u32)]
pub enum AccelDeviceClass {
    /// General-purpose GPU (compute + graphics + display).
    Gpu             = 0,
    /// Compute-only GPU (no display, e.g., datacenter SKUs).
    GpuCompute      = 1,
    /// Neural Processing Unit (fixed-function inference).
    Npu             = 2,
    /// Tensor Processing Unit / AI ASIC.
    Tpu             = 3,
    /// FPGA with compute overlay.
    Fpga            = 4,
    /// Digital Signal Processor.
    Dsp             = 5,
    /// Media processor (video encode/decode).
    MediaProcessor  = 6,
    /// Computational storage device with embedded compute capability.
    ComputeStorage  = 7,
    /// Other / vendor-specific.
    Other           = 255,
}

#[repr(u32)]
pub enum PreemptionGranularity {
    /// No preemption support. Context runs to completion.
    None            = 0,
    /// Preempt at command buffer boundaries.
    CommandBuffer   = 1,
    /// Preempt at draw call / dispatch boundaries.
    DrawDispatch    = 2,
    /// Preempt at instruction level (mid-shader/kernel).
    Instruction     = 3,
}

/// Reason for requesting preemption of a running accelerator context.
/// Passed to `preempt_context` so the driver can log/report the cause
/// and, on devices that support it, choose an appropriate preemption
/// strategy (e.g., urgent drain vs. graceful yield).
#[repr(u32)]
pub enum PreemptReason {
    /// A higher-priority submission is waiting behind this context.
    /// The scheduler detected priority inversion and is preempting the
    /// lower-priority context to unblock the higher-priority one.
    PriorityInversion = 0,
    /// The context has exceeded its CBS bandwidth budget for the current
    /// scheduling period. Fairness enforcement requires yielding the
    /// device so other contexts can use their guaranteed bandwidth.
    FairnessTimeout   = 1,
    /// The context's current submission has exceeded its
    /// `AccelContextLimits::max_execution_us` timeout. The scheduler is
    /// preempting to enforce the per-submission execution time limit.
    ExecutionTimeout  = 2,
    /// Administrative eviction: the context is being forcibly removed
    /// from the device. Causes include driver unload, device reset,
    /// cgroup removal, or process exit cleanup.
    AdminEvict        = 3,
}

/// Per-context resource limits (enforced by kernel).
#[repr(C)]
pub struct AccelContextLimits {
    /// Maximum device memory this context can allocate (bytes).
    /// 0 = no limit (subject to device capacity).
    pub max_memory_bytes: u64,

    /// Maximum compute time per submission (microseconds).
    /// 0 = no limit. Submissions exceeding this are preempted.
    pub max_execution_us: u64,

    /// Bandwidth guarantee: guaranteed microseconds of compute
    /// per scheduling period. See Section 44 (cgroup integration).
    /// 0 = best-effort (no guarantee).
    pub guaranteed_bandwidth_us: u64,

    /// Scheduling period for bandwidth accounting (microseconds).
    /// Default: 1_000_000 (1 second).
    pub bandwidth_period_us: u64,

    pub _pad: [u8; 32],
}

/// Scheduling priority for accelerator contexts.
#[repr(u32)]
pub enum AccelPriority {
    /// Background / batch. Lowest priority. Preempted by all others.
    Background  = 0,
    /// Normal interactive workload.
    Normal      = 1,
    /// High priority (e.g., real-time inference with SLO).
    High        = 2,
    /// Realtime. Highest priority. Preempts all others immediately.
    Realtime    = 3,
}

/// Entry describing a page to migrate between CPU and device memory.
/// Used by migrate_pages() in AccelBaseVTable.
#[repr(C)]
pub struct AccelMigrationEntry {
    /// Virtual address of the page to migrate (must be page-aligned).
    pub vaddr: u64,
    /// Current location: 0 = CPU RAM, 1 = device memory.
    pub current_location: u8,
    /// Desired location after migration.
    pub target_location: u8,
    /// Result of migration (filled by driver): 0 = success, errno on failure.
    pub result: i16,
    pub _pad: [u8; 4],
}

bitflags! {
    /// Flags controlling page migration behavior.
    #[repr(C)]
    pub struct MigrationFlags: u32 {
        /// Migrate from CPU RAM to device memory.
        const TO_DEVICE   = 1 << 0;
        /// Migrate from device memory to CPU RAM.
        const TO_CPU      = 1 << 1;
        /// Force migration even if page is hot/pinned.
        const FORCE       = 1 << 2;
        /// Asynchronous migration (return immediately, complete via callback).
        const ASYNC       = 1 << 3;
        /// Prefetch hint (migrate pages likely to be accessed soon).
        const PREFETCH    = 1 << 4;
    }
}

/// Device utilization report.
#[repr(C)]
pub struct AccelUtilization {
    /// Compute utilization 0-100%.
    pub compute_percent: u32,
    /// Memory bandwidth utilization 0-100%.
    pub memory_bw_percent: u32,
    /// Memory usage (bytes allocated).
    pub memory_used_bytes: u64,
    /// Memory total (bytes).
    pub memory_total_bytes: u64,
    /// Number of active contexts.
    pub active_contexts: u32,
    /// Current temperature (millidegrees Celsius).
    pub temperature_mc: u32,
    /// Current power draw (milliwatts).
    pub power_mw: u32,
    /// Current clock frequency (MHz).
    pub clock_mhz: u32,
}
// Opaque handle types (all u64 newtypes, #[repr(transparent)] for
// zero-cost FFI — the newtype has the same ABI as the inner u64).
#[repr(transparent)]
pub struct AccelContextHandle(pub u64);
#[repr(transparent)]
pub struct AccelMemHandle(pub u64);
#[repr(transparent)]
pub struct AccelSubmissionHandle(pub u64);
#[repr(transparent)]
pub struct P2pMappingHandle(pub u64);
/// Fence for synchronizing command submissions.
// The (device_id, context_id, value) tuple uniquely identifies a fence point.
// Cross-device fence signaling requires both devices to be registered in the
// same AccelFenceRegistry (Section 42.2.4).
#[repr(C)]
pub struct AccelFence {
    pub fence_type: AccelFenceType,
    pub value: u64,
    pub device_id: DeviceNodeId,  // Owning device (prevents cross-device forgery)
    pub context_id: u32,          // Owning context within the device
}

#[repr(u32)]
pub enum AccelFenceType {
    /// Device-local fence (GPU timeline semaphore).
    DeviceLocal  = 0,
    /// Cross-device fence (for P2P synchronization).
    CrossDevice  = 1,
    /// CPU-signalable fence (host-side event).
    CpuSignal    = 2,
}

#[repr(u32)]
pub enum AccelCompletionStatus {
    /// Command completed successfully.
    Success       = 0,
    /// Command failed with device error.
    Error         = 1,
    /// Command timed out (exceeded max_execution_us).
    Timeout       = 2,
    /// Command was preempted by higher-priority context.
    Preempted     = 3,
    /// Partial error: some work completed, some failed.
    PartialError  = 4,
}

/// Global registry for cross-device fence synchronization.
/// Devices must be registered in the same AccelFenceRegistry to share
/// cross-device fences. The registry is kernel-internal; userspace
/// interacts through /dev/isle-accel-N ioctls.
///
/// There is one global AccelFenceRegistry per accelerator topology domain
/// (typically one per NUMA node or PCIe domain). Devices in different
/// domains cannot share cross-device fences.
pub struct AccelFenceRegistry {
    /// Registered devices (device_id -> fence_protocol_support).
    devices: spin::RwLock<BTreeMap<DeviceNodeId, FenceProtocolSupport>>,

    /// Active cross-device fences, keyed by (device_id, context_id, value).
    /// The fence's signaling status is atomically updated by the device
    /// that owns the fence. Devices polling on a foreign fence read from
    /// this table.
    fences: spin::RwLock<BTreeMap<(DeviceNodeId, u32, u64), AtomicBool>>,

    /// Waiters for fence signaling (device_id -> list of waiters).
    /// When a fence signals, all registered waiters are notified via
    /// their registered callback.
    waiters: spin::RwLock<BTreeMap<DeviceNodeId, Vec<FenceWaiterEntry>>>,
}

/// Fence protocol capabilities for a registered device.
#[repr(C)]
pub struct FenceProtocolSupport {
    /// Device supports timeline semaphores (monotonically increasing value).
    pub timeline_semaphores: bool,
    /// Device supports cross-device signaling via hardware sync objects.
    pub hw_cross_device: bool,
    /// Device supports CPU-signaling of device fences (for host wait).
    pub cpu_signal: bool,
    /// Maximum fence value before wrap (0 = no limit).
    pub max_fence_value: u64,
}

/// Entry in the fence waiter table.
struct FenceWaiterEntry {
    /// Fence being waited on.
    fence: AccelFence,
    /// Device waiting for the fence.
    waiter_device: DeviceNodeId,
    /// Callback to invoke when fence signals (driver-provided).
    callback: unsafe extern "C" fn(device_id: DeviceNodeId, fence: AccelFence),
}

42.2.4 Kernel-Side Accelerator Scheduler

The kernel maintains an accelerator scheduler that sits between userspace submissions and the driver's submit_commands. This is the core innovation — the kernel sees and controls accelerator time.

// isle-core/src/accel/scheduler.rs (kernel-internal)

/// Per-accelerator scheduler.
pub struct AccelScheduler {
    /// Device this scheduler manages.
    device_id: DeviceNodeId,

    /// Device capabilities (cached from get_info).
    device_info: AccelDeviceInfo,

    /// Active contexts, ordered by priority and deadline.
    /// Uses a fixed-capacity sorted array (no heap allocation).
    /// Capacity is set to `device_info.max_contexts` at initialization.
    contexts: FixedSortedArray<AccelContextHandle, AccelContextState>,

    /// Per-context CBS bandwidth servers (same algorithm as CPU scheduler,
    /// see Section 15). Pre-allocated at initialization with capacity
    /// equal to `device_info.max_contexts`.
    bandwidth_servers: FixedVec<AccelCbsServer>,

    /// Hardware command queue depth (how many submissions are in-flight).
    hw_queue_depth: AtomicU32,

    /// Maximum concurrent in-flight submissions.
    max_inflight: u32,

    /// Scheduling policy.
    policy: AccelSchedPolicy,
}

pub struct AccelContextState {
    /// Owner process/cgroup.
    owner_pid: u32,
    cgroup_id: u64,

    /// Priority class.
    priority: AccelPriority,

    /// Resource limits.
    limits: AccelContextLimits,

    /// Index into `AccelScheduler::bandwidth_servers` (if guaranteed
    /// bandwidth is set). The CBS server state is owned by the scheduler's
    /// `bandwidth_servers` array — storing it here as well would create
    /// a consistency hazard (two copies of the same server state). The
    /// scheduler uses this index to look up the context's CBS server
    /// during scheduling decisions.
    cbs_server_index: Option<u32>,

    /// Pending command buffers waiting to be submitted to hardware.
    /// Fixed-capacity ring buffer, pre-allocated at context creation.
    pending_queue: FixedRingBuffer<PendingSubmission>,

    /// In-flight submissions (on hardware).
    /// Fixed-capacity array, sized to `max_inflight` at initialization.
    inflight: FixedVec<AccelSubmissionHandle>,

    /// Accounting: total compute time consumed (nanoseconds).
    total_compute_ns: AtomicU64,

    /// Accounting: total memory allocated (bytes).
    total_memory_bytes: AtomicU64,
}

/// Pending submission waiting to be dispatched to hardware.
/// Stored in the context's `pending_queue` until the scheduler submits it.
#[repr(C)]
pub struct PendingSubmission {
    /// Handle to the command buffer (driver-allocated).
    pub cmd_buffer: AccelCmdBufferHandle,

    /// Submission ID assigned by the driver for tracking.
    pub driver_submit_id: u64,

    /// Timestamp when submission was queued (for timeout detection).
    pub queued_timestamp_ns: u64,

    /// Number of fences this submission depends on (0 = immediate submit).
    pub dependency_count: u32,

    /// Semaphore to signal on completion (optional).
    pub completion_semaphore: Option<AccelSemaphoreHandle>,
}

/// Handle to an in-flight submission on hardware.
/// Tracked in the context's `inflight` array until completion.
#[repr(C)]
pub struct AccelSubmissionHandle {
    /// Driver-assigned submission ID (matches PendingSubmission::driver_submit_id).
    pub driver_submit_id: u64,

    /// Timestamp when submitted to hardware (for timeout detection).
    pub submitted_timestamp_ns: u64,

    /// Fence that will be signaled on completion.
    pub completion_fence: AccelFenceHandle,
}

#[repr(u32)]
pub enum AccelSchedPolicy {
    /// Simple round-robin between contexts. Default for NPU/FPGA.
    RoundRobin  = 0,
    /// Priority-based with preemption. Default for GPU.
    Priority    = 1,
    /// CBS bandwidth guarantee + priority. Default when cgroup limits set.
    Guaranteed  = 2,
}

/// CBS (Constant Bandwidth Server) state for per-context bandwidth enforcement.
/// Uses the same algorithm as the CPU scheduler (Section 15), adapted for
/// accelerator time. Each context with a bandwidth guarantee gets its own
/// AccelCbsServer instance in the scheduler's bandwidth_servers array.
#[repr(C)]
pub struct AccelCbsServer {
    /// Context this server is associated with.
    pub context_id: AccelContextHandle,

    /// Maximum bandwidth (nanoseconds of accelerator time per period).
    pub bandwidth_ns: u64,

    /// Period in nanoseconds (e.g., 10ms = 10_000_000).
    pub period_ns: u64,

    /// Current runtime consumed in this period (nanoseconds).
    pub runtime_consumed: AtomicU64,

    /// Absolute deadline (nanoseconds since boot). Computed as
    /// period_start + period_ns when the context is activated.
    pub deadline: AtomicU64,

    /// Start of the current period (nanoseconds since boot).
    pub period_start: AtomicU64,
}

Allocation discipline: All scheduler data structures use pre-allocated, fixed-capacity storage. No heap allocation occurs on the scheduling fast path. FixedSortedArray, FixedVec, and FixedRingBuffer are kernel-internal types that allocate their backing storage once at initialization (when the device is registered or a context is created) and never resize. This ensures that submit_commands, poll_completion, and context scheduling are deterministic-latency operations with no allocator contention.

Scheduling flow:

Userspace submits work (via /dev/isle-accel-N or DRM ioctl compat):
    |
    v
ISLE Core validates capabilities
  - Does this process have an AccelContext for this device?
  - Does this process's cgroup allow more accelerator time?
  - Does this context have memory budget for the command buffer?
    |
    v
AccelScheduler queues the submission
  - Assigns priority based on context priority + cgroup policy
  - Checks CBS bandwidth server: is this context within its guarantee?
    |
    v
AccelScheduler picks next submission to dispatch to hardware
  - Priority order: Realtime > High > Normal > Background
  - Within same priority: CBS server with earliest deadline first
  - If higher-priority work arrives and device supports preemption:
    preempt current context via driver's preempt_context()
    |
    v
Driver's submit_commands() sends work to hardware
    |
    v
Hardware completion interrupt
    |
    v
Driver's poll_completion() reports result
    |
    v
AccelScheduler updates accounting (compute time, memory)
    |
    v
Userspace receives completion notification

AccelScheduler vs Firmware Scheduling Boundary:

Modern GPUs have their own internal schedulers (e.g., NVIDIA's GPC/TPC scheduler, AMD's ACE/HWS). The AccelScheduler does NOT replace or duplicate firmware scheduling. The boundary is clear:

AccelScheduler (kernel, software):
  Controls WHICH contexts get to submit work and WHEN.
  Enforces fairness, cgroup limits, bandwidth guarantees, priorities.
  Decides the ORDER of submissions to the hardware queue.
  Granularity: per-submission (command buffer level).

Firmware scheduler (device, hardware/firmware):
  Controls HOW submitted work is mapped to hardware execution units.
  Distributes warps/wavefronts across SMs/CUs.
  Manages context switching on the device.
  Handles workgroup scheduling and occupancy.
  Granularity: per-instruction-group (warp/wavefront level).

Interaction:
  AccelScheduler submits command buffer → firmware takes over.
  The kernel does not see or control internal GPU scheduling.
  The kernel CAN preempt at context boundaries (via preempt_context()),
  but CANNOT preempt mid-kernel on hardware that doesn't support it.
  Preemption capability is reported by get_info().preemption_granularity.
  Devices with PreemptionGranularity::None do not support preemption;
  the scheduler uses cooperative yield for those devices.

  When the device has hardware preemption (NVIDIA compute preemption,
  AMD MES): AccelScheduler can request preemption, and the firmware
  will drain the current workgroup and save context. Modern GPUs
  (Ampere+, CDNA2+) complete preemption in ~50μs-10ms including
  context save/restore, depending on preemption granularity and
  in-flight workload (instruction-level compute preemption is
  typically 50-100μs; draw/dispatch-level can be milliseconds).
  This is expensive compared to CPU context switches (~1μs) but
  bounded. The AccelScheduler therefore avoids preemption
  on the fast path (submissions are ordered by priority before dispatch)
  and triggers preemption only when necessary: priority inversion
  (a Realtime context is blocked behind a Background dispatch),
  fairness enforcement (a context has exceeded its CBS bandwidth
  budget), or timeout (max_execution_us exceeded).

  Older GPUs (pre-Ampere NVIDIA, pre-CDNA2 AMD) may report
  PreemptionGranularity::CommandBuffer or ::None, where preemption
  latency is bounded by the longest in-flight command buffer
  (potentially hundreds of milliseconds for large compute dispatches).
  The scheduler treats these as cooperative-yield devices (see below).

Non-preemptible GPU limitation and cooperative yield:

Many GPUs (older hardware, NPUs, FPGAs, and some mid-range consumer GPUs) report PreemptionGranularity::None. On these devices, the AccelScheduler CANNOT interrupt a running workload mid-dispatch. The scheduler can only enforce time slicing at submission boundaries — between command buffer submissions — not within a single dispatch. This is coarse-grained time slicing.

Consequences for non-preemptible devices: - max_execution_us is enforced at submission granularity, not instruction granularity. A single long-running dispatch that exceeds max_execution_us cannot be aborted; the scheduler must wait for it to complete naturally before taking corrective action. - Time guarantees for other contexts degrade proportionally to the longest single dispatch in the queue. Workloads with large dispatches should set small max_cmd_buffer_size limits (enforced by the kernel at submission time) to bound worst-case dispatch duration.

Cooperative yield mechanism for non-preemptible devices: - The driver implements preempt_context() as a cooperative yield: after the current command buffer completes, do not submit the next one, and signal the scheduler. This is NOT true preemption — the driver cannot interrupt the GPU mid-dispatch. It stops feeding new work at the next natural command buffer boundary. - The preempt_context vtable entry serves double duty: on preemptible devices it triggers true hardware preemption (~50μs-10ms); on non-preemptible devices it triggers cooperative yield (latency bounded by longest in-flight command buffer). The scheduler checks preemption_granularity to know which behavior to expect. - Drivers for non-preemptible devices MUST report PreemptionGranularity::None in get_info(). Misreporting this field is a driver correctness violation. - The AccelScheduler logs a warning when a non-preemptible context exceeds max_execution_us, records the overrun duration in AccelContextState::total_compute_ns, and applies backpressure (delayed next submission) to amortize the overrun across the cgroup's next scheduling period.

42.2.5 AccelComputeVTable — Compute-Specific Extensions

For devices with programmable compute (GPUs, some NPUs):

#[repr(C)]
pub struct AccelComputeVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Query supported compute APIs (Vulkan compute, OpenCL, CUDA compat).
    pub get_compute_caps: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_caps: *mut ComputeCapabilities,
    ) -> IoResultCode,

    /// Set up a shared virtual address space between CPU and device.
    /// Enables SVM (Shared Virtual Memory) / unified addressing.
    pub enable_svm: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        process_page_table: u64,    // CR3 / TTBR0 of the owning process
    ) -> IoResultCode>,

    /// Handle a device-initiated page fault.
    /// Called by the driver when the device faults on an unmapped address.
    /// The kernel resolves the fault (allocate page, migrate, map) and
    /// tells the driver to retry.
    pub handle_device_fault: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        fault_addr: u64,
        fault_flags: u32,
        out_resolution: *mut FaultResolution,
    ) -> IoResultCode>,

    /// Get performance counters for a context.
    pub get_perf_counters: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        counters: *mut AccelPerfCounters,
    ) -> IoResultCode>,
}

/// Resolution of a device-initiated page fault, returned by handle_device_fault.
/// Tells the kernel how to proceed after a GPU/NPU fault on a shared virtual
/// memory address.
#[repr(C)]
pub struct FaultResolution {
    /// Action the driver should take.
    pub action: FaultAction,
    /// If action is MapLocal or MapRemote, the physical page to map.
    /// Ignored for other actions.
    pub page_paddr: u64,
    /// If action is MapLocal or MapRemote, the page table entry flags
    /// (readable/writable/executable, caching attributes).
    pub pte_flags: u64,
}

#[repr(u32)]
pub enum FaultAction {
    /// Retry the faulting access — kernel resolved it (e.g., allocated page).
    Retry = 0,
    /// Map the provided physical page locally (device-local memory).
    MapLocal = 1,
    /// Map the provided physical page as remote (system memory, peer device).
    MapRemote = 2,
    /// Deliver SIGBUS or SIGSEGV to the faulting context's owner process.
    Signal = 3,
    /// Kill the context (unrecoverable device error).
    KillContext = 4,
}

/// Performance counters for an accelerator context.
/// Returned by get_perf_counters() for monitoring and accounting.
#[repr(C)]
pub struct AccelPerfCounters {
    /// Total GPU/accelerator cycles executed for this context.
    pub cycles_executed: u64,
    /// Total bytes read from device memory.
    pub bytes_read: u64,
    /// Total bytes written to device memory.
    pub bytes_written: u64,
    /// Number of kernels/shaders executed.
    pub kernel_count: u64,
    /// Number of page faults (SVM/Unified Memory).
    pub page_faults: u64,
    /// Time spent stalled on memory (nanoseconds).
    pub memory_stall_ns: u64,
    /// Time spent executing (nanoseconds).
    pub execution_ns: u64,
    /// Device-specific counters (vendor-defined).
    pub vendor_counters: [u64; 8],
    pub _pad: [u8; 32],
}

#[repr(C)]
pub struct ComputeCapabilities {
    /// Shader/kernel ISA generation (vendor-specific version number).
    pub isa_version: u32,
    /// Maximum work group / thread block size.
    pub max_workgroup_size: u32,
    /// Maximum shared / local memory per work group (bytes).
    pub max_local_memory: u32,
    /// Number of shader/compute cores.
    pub shader_cores: u32,
    /// FLOPS (single precision, billions).
    pub fp32_gflops: u32,
    /// FLOPS (half precision, billions, 0 if unsupported).
    pub fp16_gflops: u32,
    /// INT8 TOPS (billions of 8-bit integer ops, for inference).
    pub int8_tops: u32,
    /// Tensor core / matrix unit count (0 if none).
    pub tensor_cores: u32,
    pub _pad: [u8; 32],
}

42.2.6 AccelDisplayVTable — Display-Capable Extensions

For devices with display output (GPUs with connectors). This vtable covers KMS mode-setting and framebuffer management; rendering/compositing acceleration is handled through AccelComputeVTable (compute shaders) or vendor-specific extensions. Advanced display features (HDR metadata, content-adaptive sync policies, display stream compression) are reserved for Phase 4 (Section 14-roadmap).

/// Display vtable for accelerators with display output.
/// Covers mode setting, framebuffer management, and hotplug.
#[repr(C)]
pub struct AccelDisplayVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Enumerate display connectors on this device.
    pub get_connectors: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_connectors: *mut AccelConnectorInfo,
        max_connectors: u32,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Get supported display modes for a connector.
    pub get_modes: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_modes: *mut AccelDisplayMode,
        max_modes: u32,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Set the active display mode for a connector.
    pub set_mode: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        mode: *const AccelDisplayMode,
    ) -> IoResultCode,

    /// Create a framebuffer object from device memory.
    pub create_framebuffer: unsafe extern "C" fn(
        ctx: *mut c_void,
        mem: AccelMemHandle,
        width: u32,
        height: u32,
        format: u32,
        out_fb: *mut AccelFramebufferHandle,
    ) -> IoResultCode,

    /// Schedule a page flip (swap front/back buffer).
    pub page_flip: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        fb: AccelFramebufferHandle,
        out_fence: *mut AccelFence,
    ) -> IoResultCode,

    /// Set display power management state.
    pub set_dpms_state: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        state: AccelDpmsState,
    ) -> IoResultCode,

    /// Get hotplug status for a connector.
    // KABI: u8 instead of bool for stable ABI (0=false, nonzero=true)
    pub get_hotplug_status: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_connected: *mut u8,
    ) -> IoResultCode,

    /// Read EDID data from a connector.
    pub read_edid: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_edid: *mut u8,
        max_size: u32,
        out_size: *mut u32,
    ) -> IoResultCode,

    /// Set hardware cursor image and position.
    pub set_cursor: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        image: *const u8,
        width: u32,
        height: u32,
        hot_x: u32,
        hot_y: u32,
    ) -> IoResultCode>,

    /// Enable/disable variable refresh rate (FreeSync/G-Sync).
    pub set_vrr_enabled: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        enabled: u8,
    ) -> IoResultCode>,
}

/// Display connector information.
#[repr(C)]
pub struct AccelConnectorInfo {
    /// Unique connector ID within this device.
    pub connector_id: u32,
    /// Connector type: HDMI=0, DisplayPort=1, DVI=2, VGA=3, eDP=4, Virtual=5.
    pub connector_type: u16,
    /// Maximum resolution width supported (0 if unknown).
    pub max_width: u32,
    /// Maximum resolution height supported (0 if unknown).
    pub max_height: u32,
    /// Currently connected (1) or disconnected (0).
    pub connected: u8,
    /// Currently active (has a mode set).
    pub active: u8,
    pub _pad: [u8; 6],
}

/// Display mode (resolution, refresh rate, timing).
#[repr(C)]
pub struct AccelDisplayMode {
    /// Horizontal resolution in pixels.
    pub width: u32,
    /// Vertical resolution in lines.
    pub height: u32,
    /// Refresh rate in millihertz (e.g., 59950 for 59.95 Hz).
    pub refresh_millihz: u32,
    /// Clock frequency in kHz.
    pub clock_khz: u32,
    /// Horizontal front porch in pixels.
    pub hfp: u16,
    /// Horizontal sync pulse width in pixels.
    pub hsw: u16,
    /// Horizontal back porch in pixels.
    pub hbp: u16,
    /// Vertical front porch in lines.
    pub vfp: u16,
    /// Vertical sync pulse width in lines.
    pub vsw: u16,
    /// Vertical back porch in lines.
    pub vbp: u16,
    /// Preferred mode flag (monitor's preferred resolution).
    pub preferred: u8,
    pub _pad: [u8; 7],
}

/// Framebuffer handle (opaque).
#[repr(transparent)]
pub struct AccelFramebufferHandle(pub u64);

/// Command buffer handle (opaque, driver-allocated).
/// The driver allocates command buffer memory and returns this handle to the kernel
/// for tracking. The handle is used in `PendingSubmission` to identify which command
/// buffer to execute.
#[repr(transparent)]
pub struct AccelCmdBufferHandle(pub u64);

/// Semaphore handle (opaque, driver-allocated).
/// Used for cross-context synchronization. A submission can optionally signal a
/// semaphore on completion, allowing other contexts or userspace to wait on it.
#[repr(transparent)]
pub struct AccelSemaphoreHandle(pub u64);

/// Display power management state.
#[repr(u32)]
pub enum AccelDpmsState {
    /// Display is on.
    On = 0,
    /// Display is in standby (minimal power, fast resume).
    Standby = 1,
    /// Display is suspended (low power, slower resume).
    Suspend = 2,
    /// Display is off.
    Off = 3,
}

42.2.7 Scheduler Integration

The AccelScheduler interacts with the CPU scheduler to manage threads that are waiting for GPU work completion:

  • GPU-blocked threads: Threads waiting on poll_completion or a completion callback are placed in interruptible sleep (same state as I/O wait). They do not consume CPU time while waiting.
  • Completion wakeup: When the completion notification callback fires (see register_completion_callback), the waiting thread is woken directly from the interrupt handler — no polling required.
  • CPU vruntime accounting: Time spent blocked on GPU completion is treated like I/O wait: it does not accrue CPU debt. On wakeup, the thread receives a latency bonus similar to I/O-bound tasks in CFS, ensuring responsive scheduling for interactive GPU workloads.
  • Realtime priority contexts: For AccelPriority::Realtime contexts, the completion interrupt is handled by a dedicated high-priority IRQ thread to minimize wakeup latency.

42.3 Integration with ISLE Architecture

42.3.1 Device Registry Integration

Accelerators are modeled in the device registry (Section 7) with rich sub-device structure:

pci0000:00
  +-- 0000:41:00.0 (NVIDIA A100 GPU)
      +-- Properties:
      |     device-class: "gpu"
      |     compute-units: 108
      |     local-memory-bytes: 85899345920 (80 GB HBM2e)
      |     tensor-cores: 432
      |     p2p-capable: true
      |     preemption: "instruction"
      +-- Services published:
      |     "accel-compute" (AccelComputeVTable)
      |     "rdma-target" (for GPUDirect RDMA)
      +-- Children:
          +-- partition0 (MIG 3g.40gb)  # A100-80GB supports up to 2x 3g.40gb
          +-- partition1 (MIG 3g.40gb)
          +-- nvenc0 (video encoder sub-device)
          +-- nvdec0 (video decoder sub-device)

Example: NVIDIA RTX 4090 (desktop GPU with display output):

pci0000:00
  +-- 0000:01:00.0 (NVIDIA RTX 4090)
      +-- Properties:
      |     device-class: "gpu"
      |     compute-units: 128
      |     local-memory-bytes: 25769803776 (24GB GDDR6X)
      |     tensor-cores: 512
      |     p2p-capable: false
      |     preemption: "instruction"
      +-- Services published:
      |     "accel-compute" (AccelComputeVTable)
      |     "accel-display" (AccelDisplayVTable)
      +-- Children:
          +-- nvenc0 (video encoder sub-device)
          +-- nvdec0 (video decoder sub-device)

Power Management Integration:

On system suspend, the AccelScheduler drains in-flight submissions (with a timeout matching the device registry's suspend timeout, Section 7.5.3). Pending queue entries are preserved across suspend/resume and resubmitted after the device resumes. The AccelScheduler reports idle/busy state to the device registry for runtime power management decisions. AccelPowerState maps to the registry's PowerState as follows: D0Active ↔ active, D1LowPower ↔ idle.

42.3.2 Crash Recovery

GPU driver crashes are currently catastrophic in Linux. In ISLE:

1. GPU driver (Tier 1, domain-isolated) faults.
2. ISLE Core detects the fault.
3. Device registry transitions GPU node to Recovering.
4. AccelScheduler:
   a. Marks all active contexts as "interrupted."
   b. Completes all pending submissions with error status.
   c. Notifies all processes waiting on completions.
5. Registry orchestrates crash recovery (Section 7.10):
   a. Revoke driver's isolation domain.
   b. GPU device reset (PCIe FLR or driver-specific reset).
   c. Reload driver binary.
   d. Fresh KABI vtable exchange.
6. AccelScheduler recreates contexts for processes that are still running.
   - Process state is on CPU side (command buffers). Only the GPU-side state is lost.
   - Processes receive an error on their in-flight submissions and can retry.
7. Total recovery time: ~100-500ms (dominated by GPU reset).

Compare Linux: full system reboot (30-60 seconds), loss of all work.

42.3.3 GPU Firmware as Cluster Member (Future)

Modern GPUs run sophisticated firmware that manages scheduling, memory, and P2P transfers. NVIDIA GPUs run proprietary firmware; AMD GPUs run open-source firmware. This firmware is effectively a specialized OS running on the GPU's control processor.

In ISLE's distributed kernel model (Section 47), GPU firmware can potentially participate as a first-class cluster member:

Current state (Phase 1-3): - Host kernel controls GPU via driver (Tier 1) - GPU firmware is passive — it responds to host commands but does not initiate - All scheduling decisions, memory allocation, and work dispatch are host-driven

Future state (Phase 5-6): - GPU firmware implements ISLE cluster membership protocol - GPU registers as a cluster node with NodeId, exposes VRAM as remote-accessible memory in the device registry - GPU firmware participates in DSM (Distributed Shared Memory) and DLM (Distributed Lock Manager) — allows GPU-initiated RDMA transfers, GPU-to-GPU locking without CPU involvement - Work can be scheduled across the cluster transparently: a cgroup's workload could span CPUs on node0, GPUs on node0 and node1, and a DPU on node2

Benefits: - GPU-initiated transfers: GPU firmware can directly request pages via DSM without waking the CPU driver, reducing latency for fine-grained data access patterns - GPU-to-GPU coordination: Two GPUs on different nodes can synchronize via DLM without routing through their respective host CPUs — critical for distributed training with NCCL/RCCL - Heterogeneous scheduling: The kernel scheduler sees CPUs and GPUs as a unified pool of compute resources, not separate domains

Requirements: - GPU firmware must implement ISLE's inter-kernel messaging protocol (see Section 47.2.2 "Device-local kernels as cluster members" for detailed protocol specification) - Three implementation paths: (A) run full ISLE on GPU's control processor, (B) firmware shim translating ISLE messages to native GPU operations, or (C) host-side proxy driver - Requires firmware modifications by NVIDIA/AMD/Intel or open-source GPU firmware (e.g., Nouveau, AMDGPU firmware) - Security model must prevent malicious GPU firmware from compromising host integrity — cluster membership is opt-in per device and requires signed firmware verification

This is not required for ISLE to function. It is an optional future enhancement that treats the "multikernel" model (one kernel per hardware domain) as a first-class design pattern rather than a hack. See Section 47.2.2 "Device-local kernels as cluster members" for SmartNIC/DPU equivalents.

42.3.4 FMA Integration

The FMA engine (Section 39) monitors accelerator health:

// Accelerator-specific health events
HealthEventClass::Accelerator  // New class

// Health data for accelerators
#[repr(C)]
pub struct AccelHealthData {
    /// GPU temperature (millidegrees Celsius).
    pub temperature_mc: u32,
    /// Power draw (milliwatts).
    pub power_mw: u32,
    /// ECC error count (VRAM).
    pub ecc_correctable: u64,
    pub ecc_uncorrectable: u64,
    /// Thermal throttling events.
    pub throttle_count: u32,
    /// XID error code (NVIDIA) or equivalent.
    pub error_code: u32,
    /// PCIe replay count (indicates link instability).
    pub pcie_replay_count: u32,
    pub _pad: [u8; 20],
}

FMA diagnosis rules for accelerators:

Rule Threshold Action
VRAM ECC degradation 100 correctable / hour Alert + schedule maintenance
VRAM uncorrectable error 1 event DisableDevice + Alert
Thermal throttling 10 events / hour Alert (may indicate cooling issue)
PCIe link unstable 50 replays / minute DemoteTier + Alert
Repeated driver crashes 3 in 1 hour DemoteTier (move to Tier 2)

42.3.5 Stable Tracepoints

New stable tracepoints for accelerator observability (Section 40):

Tracepoint Arguments Description
isle_tp_stable_accel_submit device_id, context, cmd_size, priority Command submitted
isle_tp_stable_accel_complete device_id, context, latency_ns, error Command completed
isle_tp_stable_accel_preempt device_id, preempted_ctx, preempting_ctx Context preempted
isle_tp_stable_accel_migrate device_id, direction, pages, bytes Memory migration
isle_tp_stable_accel_fault device_id, context, fault_addr Device page fault
isle_tp_stable_accel_oom device_id, requested, available Device memory exhaustion
isle_tp_stable_accel_p2p src_device, dst_device, bytes, latency_ns P2P DMA transfer

42.3.6 Object Namespace

Accelerators appear in the unified object namespace (Section 41):

\Devices\pci0000:00\0000:41:00.0   # GPU device object
\Accelerators\gpu0                   # Symlink to device
\Accelerators\gpu0\Contexts\        # Active execution contexts
\Accelerators\gpu0\Memory\          # Memory allocation tracking
\Accelerators\gpu0\Partitions\      # MIG partitions

Browsable via islefs:

cat /mnt/isle/Accelerators/gpu0/Memory
# type: AcceleratorMemory
# total: 40000000000 (40 GB)
# used: 27000000000 (27 GB)
# contexts:
#   ctx[pid=1234]: 16106127360 (15 GB) - training job
#   ctx[pid=5678]: 8589934592 (8 GB) - inference server
#   ctx[pid=9012]: 4294967296 (4 GB) - development

42.3.7 Partial Failure Handling

Single-context error: If a single execution context encounters an error (shader hang, invalid command), the AccelScheduler calls reset_context() on the affected context only. Other contexts on the same device continue unaffected. The affected process receives an error status on its next poll_completion call or via the completion callback.

ECC uncorrectable error: When device memory reports an uncorrectable ECC error, the affected page is retired (analogous to CPU page retirement, Section 39.6). The owning context is notified, and if possible, data is migrated to a healthy page. If the page content is unrecoverable, the owning process receives SIGBUS.

Timeout without crash: The AccelScheduler enforces max_execution_us via a kernel timer. When a submission exceeds its time limit: - If the device supports instruction-level preemption: preempt the overdue context, save its state, and return AccelCompletionStatus::Preempted. - If preemption is not supported: wait until the current command buffer boundary, then fail the overdue submission with AccelCompletionStatus::Timeout. - In either case, the context remains valid and the process can submit new work.

42.3.8 Open Questions

The following items require further design work:

  • Full /dev/isle-accel-N ioctl specification (number assignments, structure layouts, versioning scheme).
  • Context save/restore mechanism for mid-shader preemption (how much state must be saved, where it is stored, latency budget for save/restore).
  • Multi-GPU unified memory coherence granularity: should the coherence unit be 4KB (standard pages), 64KB (GPU-friendly), or 2MB (huge pages)? Trade-off between false sharing and transfer overhead.
  • HealthEventClass::Accelerator formal taxonomy: define the complete set of health event types, severity levels, and recommended actions for accelerator hardware.

42.4 Implementation Phasing

Component Phase Dependencies Notes
AccelBaseVTable KABI definition Phase 3 Driver SDK, device registry Define the interface first
Accelerator scheduler (basic) Phase 3-4 AccelBase, cgroups Round-robin, then priority
DRM/KMS compatibility shim Phase 3-4 AccelBase Required for desktop GPU
Heterogeneous memory (basic) Phase 4 Memory manager, AccelBase CPU-device migration
P2P DMA Phase 4 IOMMU, AccelBase Requires IOMMU support
Cgroup accel controller Phase 4 AccelScheduler, cgroups Compute time + memory limits
CBS bandwidth guarantees Phase 4-5 AccelScheduler, CBS Guaranteed compute time
Hardware preemption Phase 5 AccelScheduler, GPU driver Driver-dependent
In-kernel inference (basic) Phase 4 Decision trees only Start with page prefetching
In-kernel inference (neural) Phase 5 Quantized INT8 runtime Requires validation
RDMA KABI Phase 5 Network drivers Separate from GPU
GPUDirect RDMA Phase 5+ RDMA + P2P DMA Cross-driver
NVIDIA KABI driver (basic init) Phase 3-4 AccelBase, KABI, ioctl compat nvidia-smi + simple CUDA (Section 46.2.3)
NVIDIA KABI driver (UVM) Phase 4-5 ISLE HMM, NVIDIA basic cudaMallocManaged + multi-GPU (Section 46.2.3)
NVIDIA KABI driver (production) Phase 5 All NVIDIA components Full CUDA stack + crash recovery
Device partitioning Phase 5+ AccelScheduler, registry Hardware-dependent
Topology-assisted collectives Phase 5+ Registry, RDMA, P2P Optimization layer

Priority Rationale

Phase 3-4 (Real Workloads): Basic accelerator framework + DRM compat + simple scheduling. This is the minimum for "GPU works on ISLE."

Phase 4-5 (Production Ready): Cgroup integration, memory management, P2P DMA, in-kernel inference basics. This is when ISLE becomes better than Linux for AI/ML workloads.

Phase 5+ (Ecosystem): Advanced scheduling, RDMA, collectives, partitioning. This is the competitive advantage phase — features that Linux fundamentally cannot provide due to architectural constraints.


42.5 Licensing Summary

Component IP Source Risk
AccelBase KABI Original design (vtable pattern from existing KABI) None
Accelerator scheduler CBS (academic, 1998) + priority scheduling (textbook) None
Heterogeneous memory Academic (HMM concepts, published research) None
P2P DMA PCIe spec (public), IOMMU spec (public) None
DRM/KMS compat Linux DRM (GPLv2 interface, ioctl numbers are facts) None
In-kernel inference Original design, standard ML algorithms (public) None
RDMA InfiniBand spec (public), RoCE spec (public) None
Cgroup controller Linux cgroup v2 interface (filesystem paths are facts) None
NVIDIA KABI driver New code inspired by MIT/GPLv2 open-source nvidia.ko None (MIT-compatible, OKLF-clean)

All vendor-specific GPU features (MIG, NVLink, etc.) are accessed through the AccelBase KABI vtable — the driver implements them, not the kernel. The kernel provides the framework; vendors fill in the hardware-specific logic through KABI.


43. Accelerator Memory and P2P DMA

43.1 Heterogeneous Memory Management

43.1.1 Problem

AI models are growing exponentially. A 70B parameter model at FP16 is ~140GB. GPU VRAM is typically 24-80GB. Models larger than VRAM need transparent memory management — pages migrating between CPU RAM and GPU VRAM based on access patterns.

Linux has hmm (Heterogeneous Memory Management) and mmu_notifiers, but they are bolted onto the existing MM and poorly integrated. Each vendor (NVIDIA UVM, AMD SVM, Intel SVM) implements their own page migration logic in the driver.

43.1.2 Design: Accelerator Memory as NUMA Nodes

The memory manager (Section 12) already has NUMA awareness — per-node buddy allocators, NUMA-local page allocation, NUMA-aware page cache placement.

Key insight: Accelerator memory is just another NUMA node. A GPU with 24GB VRAM is NUMA node N+1, with different performance characteristics (higher bandwidth, higher latency from CPU, not directly CPU-accessible).

Existing NUMA topology (Section 12):

  NUMA Node 0 (CPU 0-15)     NUMA Node 1 (CPU 16-31)
  64GB DDR5                   64GB DDR5
  Per-CPU page caches         Per-CPU page caches
  Buddy allocator 0           Buddy allocator 1

Extended with accelerator memory:

  NUMA Node 2 (GPU 0 VRAM)   NUMA Node 3 (GPU 1 VRAM)
  24GB GDDR6X                 24GB GDDR6X
  Managed by GPU driver       Managed by GPU driver
  via AccelBase vtable        via AccelBase vtable

43.1.3 Memory Node Types

// isle-core/src/mem/numa.rs (extend existing)

pub enum NumaNodeType {
    /// Standard CPU-attached DDR memory.
    CpuMemory,

    /// Accelerator device-local memory (VRAM, HBM).
    /// Managed through the AccelBase KABI vtable.
    AcceleratorMemory {
        device_id: DeviceNodeId,
        bandwidth_gbs: u32,        // GB/s. e.g., 819 for HBM3 (per-stack, 6.4 Gbps/pin),
                                    //       1229 for HBM3E (JEDEC max at 9.8 Gbps/pin;
                                    //       actual products: ~1000-1200 per stack)
        latency_ns: u32,           // e.g., 500-1000 for PCIe GPU
        cpu_visible: bool,         // Can CPU access this memory via BAR?
        cpu_visible_size: u64,     // Size of CPU-visible window (may be < total)
        coherent: bool,            // CPU-device cache coherent? (CXL = yes, PCIe = no)
    },

    /// CXL-attached memory (high-capacity, CPU-accessible, higher latency).
    CxlMemory {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
    },

    /// CXL 3.0 shared memory pool visible to multiple nodes.
    /// See Section 47.12 for CXL fabric integration details.
    CxlSharedPool {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
        /// Nodes that share this pool (bitfield).
        sharing_nodes: u64,
        /// Hardware coherence protocol version.
        coherence_version: CxlCoherenceVersion,
    },
}

43.1.4 Transparent Page Migration

When a device with hw_page_faults = 1 (modern GPUs with ATS/PRI support) faults on an unmapped address, the following occurs:

1. Device encounters page fault at virtual address VA.
2. Device sends ATS (Address Translation Services) fault to IOMMU.
3. IOMMU routes fault to ISLE Core's device fault handler.
4. ISLE Core looks up VA in the owning process's VMA tree:
   a. If VA maps to a CPU-resident page:
      - Migrate the page from CPU NUMA node to device NUMA node
        (via driver's migrate_pages callback)
      - Update CPU page tables (unmap from CPU)
      - Update device page tables (map on device)
      - Resume device execution
   b. If VA is unmapped:
      - Allocate a fresh page on the device's NUMA node
      - Map in device page tables
      - Resume device execution
   c. If VA maps to another device's memory:
      - P2P migration (see Section 43)
5. Track this page's location in the unified page tracking structure.

For devices WITHOUT hardware page faults (older GPUs, most NPUs): the kernel uses software-assisted migration. Before submitting a command buffer, the scheduler identifies which pages the command buffer references and pre-migrates them. This is less efficient but functionally equivalent.

43.1.5 Page Location Tracking

// isle-core/src/mem/hmm.rs (kernel-internal)

/// Tracks where each page of a shared address space currently resides.
pub struct PageLocationTracker {
    /// Per-page location (indexed by virtual page number).
    /// Each entry records which NUMA node currently holds this page.
    locations: RadixTree<PageLocation>,

    /// Per-page migration epoch counter (indexed by virtual page number).
    /// Incremented each time the page transitions to `Migrating` state.
    /// Used to detect stale migration completions after concurrent recovery.
    /// Each entry is a single `AtomicU64`, stored in a separate radix tree
    /// to avoid bloating the `PageLocation` enum (which is already 24 bytes).
    migration_epochs: RadixTree<AtomicU64>,

    /// Side table for in-flight migrations. Slab-allocated, bounded by
    /// `MAX_CONCURRENT_MIGRATIONS` (default 4096). The `Migrating` variant
    /// stores only the slab index (`migration_id: u32`), keeping the
    /// per-page `PageLocation` entry small.
    active_migrations: Slab<MigrationRecord>,
}

/// Full source/target metadata for an in-flight page migration.
/// Stored in `PageLocationTracker::active_migrations`, referenced by
/// `PageLocation::Migrating { migration_id }`.
pub struct MigrationRecord {
    pub source_kind: PageLocationKind,
    pub source_node: u8,
    pub source_device: DeviceNodeId,
    pub source_addr: u64,
    pub target_kind: PageLocationKind,
    pub target_node: u8,
    pub target_device: DeviceNodeId,
    pub target_addr: u64,
}

/// Stored per-page in `PageLocationTracker::locations` (one entry per virtual page).
/// `#[repr(u8)]` keeps the discriminant to one byte; the full variant data is in the
/// enum payload. The radix tree stores the enum inline, so every extra byte of
/// discriminant wastes space proportional to the mapped address-space size.
#[repr(u8)]
pub enum PageLocation {
    /// Page is in CPU memory on this NUMA node.
    CpuNode(u8),

    /// Page is in accelerator device-local memory.
    DeviceLocal {
        device_id: DeviceNodeId,
        device_addr: u64,  // Device-side physical/virtual address
    },

    /// Page is in transit (being migrated).
    /// Stores only a side-table index to avoid bloating every per-page entry.
    /// The `MigrationRecord` (source, target, addresses) is stored in
    /// `PageLocationTracker::active_migrations` — a slab of fixed capacity
    /// bounded by `MAX_CONCURRENT_MIGRATIONS`. Migrations are transient
    /// (milliseconds), so the number of in-flight entries is small relative
    /// to total pages. This keeps `PageLocation` at 24 bytes (driven by
    /// `RemoteNode` / `RemoteDevice`) instead of ~48 bytes.
    Migrating {
        migration_id: u32,
    },

    /// Page is not yet allocated (demand-paged).
    NotPresent,

    /// Page is in compressed pool (Section 13).
    Compressed,

    /// Page is in swap.
    Swapped,

    // === Distributed memory locations (see Section 47.5 for the DSM protocol) ===

    /// Page is on a remote node's CPU memory, accessible via RDMA.
    RemoteNode {
        node_id: NodeId,
        remote_phys_addr: u64,
        dsm_state: DsmPageState,
    },

    /// Page is on a remote node's accelerator memory (GPUDirect RDMA).
    RemoteDevice {
        node_id: NodeId,
        device_id: DeviceNodeId,
        device_addr: u64,
    },

    /// Page is in CXL-attached memory pool (hardware-coherent).
    /// See Section 47.12 for CXL fabric integration details.
    CxlPool {
        pool_id: u32,
        pool_offset: u64,
    },
}

/// Discriminant for `PageLocation` variants, used in `MigrationRecord`
/// to record the source/target kind of an in-flight migration.
#[repr(u8)]
pub enum PageLocationKind {
    CpuNode      = 0,
    DeviceLocal  = 1,
    NotPresent   = 2,
    Compressed   = 3,
    Swapped      = 4,
    RemoteNode   = 5,
    RemoteDevice = 6,
    CxlPool      = 7,
}

43.1.6 Migration Policy

// isle-core/src/mem/hmm.rs

pub struct MigrationPolicy {
    /// Migrate on first device fault (aggressive, more migration traffic
    /// but lower steady-state fault rate).
    pub migrate_on_first_fault: bool,

    /// Migrate only after N faults on the same page within a window
    /// (conservative, less migration traffic, higher fault rate).
    pub fault_threshold: u32,
    pub fault_window_ms: u32,

    /// Batch migration: when migrating, also prefetch nearby pages
    /// (spatial locality heuristic).
    pub prefetch_pages: u32,    // 0 = no prefetch, 16 = prefetch 16 neighbors

    /// Maximum pages in-flight (being migrated) at once.
    pub max_inflight_migrations: u32,

    /// Eviction policy when device memory is full:
    /// LRU (least recently accessed on device) is the default.
    pub eviction_policy: EvictionPolicy,
}

pub enum EvictionPolicy {
    /// Evict least recently accessed pages back to CPU memory.
    Lru,
    /// Evict pages with lowest access frequency.
    Lfu,
    /// Use learned policy (in-kernel inference, Section 45).
    Learned,
}

43.1.7 Memory Oversubscription

When a model is larger than device VRAM:

Total model: 140GB (70B params, FP16)
GPU VRAM: 24GB
CPU RAM available: 256GB

Strategy:
  1. Most-accessed layers (attention heads, active KV cache) → GPU VRAM
  2. Less-accessed layers (early embedding, output projection) → CPU RAM
  3. Migration on demand: when GPU needs a page in CPU RAM, migrate it
     and evict a cold page from VRAM back to CPU RAM

The kernel manages this transparently. The ML framework sees a unified
140GB address space. Page migration is handled by the kernel's device
fault handler, exactly like CPU demand paging handles more-virtual-memory-
than-physical-RAM.

This is the accelerator equivalent of virtual memory. The kernel has been managing memory oversubscription since the 1960s. Now it does it for GPU VRAM too.

43.1.8 Huge Page Support for Tensors

ML tensors benefit from huge pages (fewer TLB misses, better DMA performance):

/// Page size selection for accelerator memory allocations.
/// This is a separate enum (not bitflags) because page sizes are mutually
/// exclusive — a single allocation uses exactly one page size.
#[repr(u32)]
pub enum AccelPageSize {
    /// Standard 4KB pages.
    Standard     = 0,
    /// 2MB huge pages.
    HugePage2M   = 1,
    /// 1GB huge pages.
    HugePage1G   = 2,
    /// Device-optimal page size (device driver selects the best size).
    DeviceOptimal = 3,
}

/// Accelerator memory allocation modifier flags (OR-combinable).
/// Combined with a separate AccelPageSize field to fully describe
/// an allocation request.
bitflags::bitflags! {
    #[repr(transparent)]
    pub struct AccelMemFlags: u32 {
        /// Non-migratable: pin in device memory, never evict to CPU.
        const PINNED          = 0x100;
        /// CPU-visible: map into CPU address space via BAR.
        const CPU_VISIBLE     = 0x200;
        /// Coherent: CPU-device coherent (requires CXL or similar).
        const COHERENT        = 0x400;
    }
}
// Usage: page_size = AccelPageSize::HugePage2M,
//        flags = AccelMemFlags::PINNED | AccelMemFlags::CPU_VISIBLE

43.1.9 Multi-Device Memory Coherence

When multiple GPUs access the same virtual address range (e.g., tensor parallelism across GPUs on the same machine), the kernel must manage coherence between device memories.

The PageLocation enum already includes DeviceLocal for single-device pages. For read-only copies replicated across multiple GPUs, the kernel uses a single-owner coherence protocol with explicit synchronization:

  • Single-owner with read sharing: At any time, one device owns the writable copy. Other devices may hold read-only copies. This mirrors the DSM protocol in Section 47.5.

  • Per-page ownership lock and wait queue: Each shared page has a per-page spinlock in the PageLocationTracker that serializes ownership transitions, plus a per-page wait queue for faulters that arrive during a migration. The spinlock protects only the page metadata (state, owner, reader list) — it is held for tens of nanoseconds, never across I/O. When a device faults on a page, the fault handler acquires this lock to inspect or modify the page's PageLocation state. If the page is in Migrating state, the faulter releases the spinlock and blocks on the per-page wait queue (not spinning) until the migration completes and wakes all waiters. The lock is per-page (not per-device or global) to avoid contention on unrelated pages.

  • Explicit ownership transfer protocol: When a device writes to a page owned by another device:

  • Fault handler acquires the per-page ownership spinlock.
  • A MigrationRecord (source and target) is allocated in the side table and the page state transitions to PageLocation::Migrating { migration_id }. The PageLocationEntry's migration_epoch counter is incremented. The migrator captures this epoch value before releasing the spinlock.
  • Per-page ownership spinlock is released. The Migrating state prevents other faulters from modifying the page — they will see Migrating and block on the per-page wait queue until the migration completes. 3a. Pre-revocation fence (NEW): Before revoking the source mapping (step 4), the kernel issues a device-specific quiesce command via preempt_context() on the source device's active context. This ensures the source device has completed all in-flight writes to the page before the DMA copy reads the data. The fence is verified via AccelFence completion before proceeding to step 5.
  • The owning device's mapping is revoked (device page table unmap via driver callback). This is a slow operation (microseconds) and runs outside the spinlock.
  • P2P DMA transfers the page data to the new owner (Section 43.2). This may take tens to hundreds of microseconds and also runs outside the spinlock.
  • All devices holding read-only copies are invalidated via the AccelScheduler (which tracks all active contexts for a given address space). Each invalidation unmaps the page from the device's page table via the driver's callback.
  • Fault handler re-acquires the per-page ownership spinlock. Stale migration detection: the migrator checks that page.migration_epoch still matches the epoch value captured at step 2. If the epoch has changed (because a crash recovery or concurrent migration reset the page state), the migration is stale — the migrator discards the DMA result and returns EAGAIN to trigger a fresh fault. This prevents stale migration completions from corrupting page state after concurrent recovery paths have already resolved the page.
  • The new owner's mapping is installed and page state transitions to DeviceLocal.
  • Per-page ownership spinlock is released.
  • All waiters on the per-page wait queue are woken. They re-fault and see the new DeviceLocal state.

Steps 4-6 (revocation, DMA transfer, and reader invalidation) run outside the spinlock, so they do not block other faulters on unrelated pages or cause spin-time proportional to I/O latency. Faulters that arrive during steps 4-6 see Migrating state and sleep on the wait queue instead of spinning. Steps 4-6 must complete before step 8 (new mapping). This is a write-invalidate protocol — the same model used by CPU cache coherence (MOESI) and the DSM protocol in Section 47.5, adapted for device memory.

migration_epoch field: Each page tracking entry in PageLocationTracker includes a migration_epoch: u64 counter. This counter is incremented at step 2 each time a page transitions to Migrating. The migrator captures the epoch value after the increment and before releasing the spinlock. At step 7, after re-acquiring the spinlock, the migrator compares the current migration_epoch against the captured value. A mismatch indicates that a device crash + recovery or a concurrent migration has intervened during the lock-free DMA window and has already resolved (or restarted) the page migration independently. In that case the DMA result is discarded and EAGAIN is returned; the faulting device will re-fault and enter a fresh migration sequence.

  • Migration error and timeout handling: Migration is a multi-step I/O operation (device page table manipulation + DMA transfer) that can fail at any point. Without explicit recovery, a page can be stuck permanently in Migrating state, blocking all faulters on its wait queue indefinitely.

Timeouts: Every migration has a deadline: 100ms for intra-node transfers (devices on the same PCIe root complex or NVLink fabric), 1s for cross-node transfers (remote devices via RDMA or CXL). The fault handler starts a kernel timer when it transitions the page to Migrating (step 2 above). If the timer fires before the migration completes (step 8), the timeout handler initiates rollback.

Migration result:

```rust /// Outcome of a page migration attempt. #[repr(u8)] pub enum MigrationResult { /// Migration completed successfully. Page is now owned by the target device. Success,

  /// Migration failed (DMA timeout, device error, IOMMU fault). Page has been
  /// rolled back to its original location on the source device. Waiters are
  /// woken to re-fault and will find the page at its original location.
  RollbackToSource,

  /// Source device crashed or was reset during migration. The page data in
  /// device memory is lost. Page is marked `NotPresent` so the next fault
  /// reloads from backing store (CPU memory copy or disk).
  SourceLost,

} ```

Failure recovery by case:

  1. DMA timeout or destination device error (most common — e.g., P2P DMA times out, destination device returns an error during page write):

    • Cancel or abandon the in-flight DMA transfer.
    • The source device still holds the original page data (it was not unmapped until the transfer completes).
    • Re-acquire the per-page ownership spinlock.
    • Transition page state back from Migrating to its original DeviceLocal or CpuNode state (the MigrationRecord in the side table, looked up via the Migrating variant's migration_id, records exactly where to roll back to).
    • Release the per-page ownership spinlock.
    • Wake all waiters on the per-page wait queue. They re-fault and find the page at its original location.
    • Increment migration_failure_count in the PageLocationTracker stats.
    • Result: MigrationResult::RollbackToSource.
  2. Source device crash during migration (rare — the source device is reset or its driver crashes while the page is in Migrating state):

    • The page data in device memory is physically lost (device reset destroys VRAM contents — see Section 42.3.2).
    • The DMA transfer (if in progress) is aborted by the device reset.
    • Re-acquire the per-page ownership spinlock.
    • Transition page state to PageLocation::NotPresent.
    • Release the per-page ownership spinlock.
    • Wake all waiters on the per-page wait queue with an error indication. Waiters re-fault and trigger a fresh page-in from backing store: if the page has a CPU memory copy (e.g., it was migrated to the device from CPU RAM and the CPU copy was preserved or the page is file-backed), the fault handler loads from the backing store. If no backing store exists (anonymous page that was only in device memory), the owning process receives SIGBUS.
    • Increment migration_failure_count and source_lost_count in stats.
    • Result: MigrationResult::SourceLost.
  3. Waiter notification on failure: Waiters on the per-page wait queue are woken with an error code (MigrationFailed or SourceDeviceLost) embedded in the wait queue wake event. On wakeup, each waiter re-enters the fault handler, re-acquires the per-page spinlock, and inspects the current page state. For RollbackToSource, the page is back at its original location and the waiter proceeds normally (read-share or initiate a fresh migration attempt). For SourceLost, the page is NotPresent and the waiter triggers a fresh fault resolution (page-in from backing store or SIGBUS).

  4. Repeated migration failures: If a page accumulates more than 3 migration failures within a 60-second window (tracked per-page in PageLocationTracker), the page is marked as non-migratable (pinned at its current location) for a cooldown period (5 minutes). This prevents pathological retry loops on pages that consistently fail to migrate (e.g., due to a flaky P2P DMA path).

  5. Read-sharing acquisition: When a device reads a page owned by another device, the fault handler acquires the per-page spinlock, checks page state. If the page is in Migrating state, the faulter releases the spinlock and blocks on the per-page wait queue (same as for write faults). Otherwise, the faulter records the additional reader in the page metadata, releases the spinlock, then performs the P2P DMA copy to create a read-only replica on the requesting device (outside the spinlock). No ownership change occurs.

  6. Distributed cluster coherence: For multi-device coherence across machines (not just local GPUs), the per-page spinlock is replaced by the DSM directory protocol (Section 47.5, 12-distributed.md). The DSM protocol already provides single-owner / multi-reader semantics with a home-node directory that serializes ownership transitions across network boundaries — the same semantics as the local per-page spinlock, but designed for cross-node operation. Local multi-GPU coherence uses the lightweight per-page spinlock and wait queue; cross-machine coherence uses the DSM directory. The DLM (Section 31a) is for filesystem-level distributed locking, not page-level coherence. The write-invalidate coherence protocol is identical in both cases — only the ownership serialization mechanism differs (local spinlock vs. DSM directory).

Huge page requirement for multi-GPU coherence: The per-page coherence tracking structures (GpuCoherenceEntry, ~64 bytes per page) mandate huge pages (2 MB) when tracking >4 GPUs simultaneously. For a 4-GPU system with 256 GB VRAM total (64 GB per GPU, 16M pages per GPU), the coherence table requires ~1 GB. Using 4 KB pages for this allocation would consume 256K page table entries and cause TLB thrashing during coherence lookups. The allocator automatically promotes coherence table allocations to 2 MB huge pages when gpu_count >= 4. On systems without huge page support, the coherence table is capped at 4 GPUs; attempting to register a 5th GPU returns -ENOMEM.

For the common case of gradient all-reduce (all GPUs read the same weights but write different gradients), the kernel keeps weights as shared-read on all devices and only migrates gradient pages on write. Since each GPU writes to its own gradient partition (non-overlapping address ranges), per-page locks are uncontended in the common case.


43.2 Peer-to-Peer DMA

43.2.1 Problem

AI workloads need data to flow between devices without CPU involvement:

  • Storage → GPU: Load training data directly from NVMe to GPU VRAM (GPUDirect Storage)
  • Network → GPU: Receive model weights from remote node directly to GPU (RDMA + GPUDirect)
  • GPU → GPU: Gradient exchange between GPUs in multi-GPU training (NVLink, xGMI, PCIe P2P)
  • GPU → Network: Send inference results directly from GPU to network (bypassing CPU copy)

Linux supports these as vendor-specific driver features. There is no general kernel mechanism.

43.2.2 Design: Generalized P2P DMA

Extend the KABI with a general mechanism for device-to-device DMA transfers:

// Appended to KernelServicesVTable

/// Set up a peer-to-peer DMA mapping between two devices.
/// The kernel programs the IOMMU to allow device A to DMA to device B's
/// memory region.
pub p2p_dma_map: Option<unsafe extern "C" fn(
    src_device: DeviceHandle,
    dst_device: DeviceHandle,
    dst_mem: AccelMemHandle,    // Or DmaBufferHandle for non-accel devices
    dst_offset: u64,
    size: u64,
    out_iova: *mut u64,         // IOVA that src_device can use to reach dst memory
    out_mapping: *mut P2pMappingHandle,
) -> IoResultCode>,

/// Tear down a P2P DMA mapping.
pub p2p_dma_unmap: Option<unsafe extern "C" fn(
    mapping: P2pMappingHandle,
) -> IoResultCode>,

/// Initiate a DMA transfer between two devices.
/// The kernel coordinates between the two drivers.
pub p2p_dma_transfer: Option<unsafe extern "C" fn(
    mapping: P2pMappingHandle,
    src_offset: u64,
    dst_offset: u64,
    size: u64,
    out_fence: *mut AccelFence,
) -> IoResultCode>,

43.2.3 IOMMU Integration

P2P DMA requires careful IOMMU programming:

Without P2P:
  Device A → IOMMU → CPU RAM ← IOMMU ← Device B
  (two DMA transfers, CPU RAM as bounce buffer)

With P2P:
  Device A → IOMMU → Device B's BAR / memory
  (one DMA transfer, no CPU involvement)

The kernel validates: 1. Both devices support P2P (PCIe ACS, correct topology) 2. IOMMU can map across devices (same IOMMU domain or IOMMU supports cross-device mapping) 3. The requesting driver has the P2P_DMA capability 4. The target memory region is within the bounds of the P2P mapping

If hardware P2P is not possible (devices behind different root complexes without P2P bridge support), the kernel falls back to CPU-mediated copy transparently. The driver API is the same either way.

43.2.4 Topology-Aware Placement

The device registry (Section 7) provides topology information. The kernel uses this for P2P path selection:

PCIe Topology:
  Root Complex 0
    +-- Root Port 0
    |   +-- GPU 0 (VRAM: 24GB)
    |   +-- NVMe 0
    +-- Root Port 1
    |   +-- GPU 1 (VRAM: 24GB)
    |   +-- NIC (100GbE)

P2P capability matrix:
  GPU 0 ↔ NVMe 0: Direct P2P (same root port) — optimal
  GPU 0 ↔ GPU 1:  P2P via root complex — good
  GPU 0 ↔ NIC:    P2P via root complex — good
  GPU 1 ↔ NVMe 0: P2P via root complex — good

The scheduler uses this topology when placing workloads: a training job that loads data from NVMe 0 is preferentially scheduled on GPU 0 (same root port = best P2P path).

43.2.5 P2P Access Control

Peer-to-peer DMA allows devices to directly access each other's memory without CPU involvement. Without proper access control, a malicious or compromised device could read or corrupt another device's memory. This section specifies the mandatory access control mechanisms for all P2P DMA operations.

Authorization Model

Every P2P DMA mapping requires explicit authorization from both the source and target device owners:

P2P DMA Authorization Flow:
  1. Client (process or driver) requests P2P mapping: src_device → dst_device memory
  2. Kernel checks:
     a. Client holds ACCEL_P2P capability (0x0102) for BOTH devices
     b. Both devices are in the client's allowed device set (cgroup accel.devices)
     c. Target memory region is within the client's allocated AccelMemHandle
     d. Source device's cgroup has not exceeded accel.p2p.max limit (if set)
  3. Kernel programs IOMMU with P2P mapping (never direct device-to-device without IOMMU)
  4. P2pMappingHandle is bound to the requesting context — not transferable
  5. Mapping is recorded in both devices' P2P ACL tables

Capability-Based Device ACLs

Each device maintains a P2P Access Control List (ACL) that specifies which other devices are authorized to initiate P2P DMA to its memory:

/// P2P ACL entry — one per authorized (src_device, dst_device) pair.
/// Stored in the device registry node for the target device.
#[repr(C)]
pub struct P2pAclEntry {
    /// Source device that is authorized to initiate P2P DMA.
    pub src_device_id: DeviceNodeId,
    /// Capability token that authorized this ACL entry.
    /// Must be valid for the ACL entry to be valid.
    pub authorizing_cap: CapabilityToken,
    /// Maximum bytes that can be transferred via this ACL entry.
    /// Optional: 0 means unlimited (subject to cgroup limits).
    pub max_bytes: u64,
    /// Bytes transferred so far (reset on ACL refresh or revocation).
    pub bytes_transferred: AtomicU64,
    /// Creation timestamp (for auditing and expiration).
    pub created_ns: u64,
    /// Expiration timestamp (0 = no expiration).
    pub expires_ns: u64,
}

The ACL is consulted on every p2p_dma_map call. The mapping is denied if: - No ACL entry exists for (src_device, dst_device) - The authorizing capability has been revoked - The byte limit has been exceeded - The ACL entry has expired

IOMMU Mediation

All P2P DMA transactions are mediated by the IOMMU. Direct device-to-device DMA without IOMMU translation is never permitted:

Security-Critical Design Decision:
  P2P DMA Path:  Device A → IOMMU → Device B's memory
  NOT:           Device A → Device B (direct PCIe P2P without IOMMU)

Rationale:
  1. IOMMU provides address translation and access validation on every transaction
  2. IOMMU can revoke access instantly by invalidating the IOVA mapping
  3. IOMMU provides fault isolation if a device goes rogue
  4. IOMMU enables per-transaction logging for security auditing

The IOMMU is programmed with per-device IOVA page tables. A P2P mapping creates an IOVA entry in the source device's page table that translates to the target device's physical BAR address. When the mapping is revoked, the IOVA entry is invalidated.

Revocation Protocol

P2P access must be revoked when: - The authorizing capability is revoked (process exit, cgroup removal) - A device is removed from the system (hot-unplug) - A device is quarantined due to error or security event (Section 42.3.2) - The cgroup limit is exceeded - An explicit unmap request is made

Revocation is synchronous and blocking:

P2P Revocation Sequence:
  1. Kernel receives revocation trigger (capability revoke, device quarantine, etc.)
  2. Kernel looks up all P2P mappings involving the affected device(s)
  3. For each active mapping:
     a. Set mapping state to REVOKING
     b. Issue IOMMU TLB invalidation for the IOVA range
     c. Wait for in-flight transactions to complete (see below)
     d. Remove the mapping from both devices' ACL tables
     e. Free the P2pMappingHandle
  4. Signal completion to revocation requester

In-Flight Transaction Handling

The critical challenge is handling P2P DMA transactions that are in-flight when revocation is requested. The kernel uses a quiesce-then-revoke protocol:

/// P2P mapping state machine.
#[repr(u8)]
pub enum P2pMappingState {
    /// Mapping is active and can be used for new transfers.
    Active = 0,
    /// Mapping is being revoked. No new transfers allowed.
    /// In-flight transactions are completing.
    Revoking = 1,
    /// Mapping is fully revoked. IOVA is invalid.
    Revoked = 2,
}

/// In-flight transaction tracking.
/// Each P2P mapping tracks pending transfers.
pub struct P2pMapping {
    pub handle: P2pMappingHandle,
    pub src_device: DeviceNodeId,
    pub dst_device: DeviceNodeId,
    pub state: AtomicU8,  // P2pMappingState
    pub iova_base: u64,
    pub iova_size: u64,
    /// Number of in-flight transfers using this mapping.
    /// Incremented before p2p_dma_transfer, decremented after fence completion.
    pub in_flight_count: AtomicU32,
    /// Wait queue for revocation to wait on in-flight completion.
    pub revoke_wait_queue: WaitQueue,
}

Revocation waits for in-flight transactions:

In-Flight Handling During Revocation:
  1. Set state to REVOKING (atomic store)
  2. New p2p_dma_transfer calls fail with EREVOKED
  3. Wait for in_flight_count to reach zero:
     - Each transfer increments in_flight_count before submission
     - Each transfer decrements after fence signals completion
     - Timeout: 100ms for local P2P, 1s for cross-node RDMA
  4. If timeout expires:
     a. Issue device quiesce via driver callback (preempt_context)
     b. Force IOMMU TLB invalidation
     c. Log warning: "P2P revocation timed out, forced quiesce"
     d. Proceed with revocation (data may be lost or corrupted)
  5. Set state to REVOKED
  6. Free mapping resources

P2P Memory Accounting

All P2P DMA memory usage is accounted to the originating cgroup:

/sys/fs/cgroup/<group>/accel.p2p.current
    # Current bytes transferred via P2P DMA (read-only)
    # Sum across all devices this cgroup can access

/sys/fs/cgroup/<group>/accel.p2p.max
    # Maximum bytes that can be transferred via P2P in the current period
    # Format: "bytes period_us"
    # Example: "10737418240 1000000" (10GB per second)
    # Default: "max" (unlimited)

/sys/fs/cgroup/<group>/accel.p2p.stat
    # P2P statistics (read-only):
    #   total_bytes <cumulative bytes transferred>
    #   mappings <current active P2P mappings>
    #   revocations <times P2P was revoked>
    #   timeout_revocations <revocations that timed out waiting for in-flight>

The byte counter is incremented on each p2p_dma_transfer completion (fence signaled). If a transfer fails or is aborted, the bytes are not counted.

Device Quarantine and P2P

When a device is quarantined (Section 42.3.2), all P2P mappings involving that device are immediately revoked using the protocol above. This prevents a quarantined device from: - Initiating new P2P DMA to other devices - Receiving P2P DMA from other devices - Corrupting or exfiltrating data via P2P paths

The quarantine process blocks until all P2P revocations complete. If any revocation times out (in-flight transactions do not complete), the device is force-reset via PCIe FLR or driver-specific reset, which guarantees termination of all DMA activity.

Security Audit Trail

All P2P authorization and revocation events are logged to the kernel audit subsystem:

Audit Events (logged to /sys/kernel/debug/isle/audit.log):
  P2P_MAP:    src_device=<id> dst_device=<id> size=<bytes> cap=<token> cgroup=<path>
  P2P_UNMAP:  mapping=<handle> reason=<explicit|cap_revoke|cgroup_exit|quarantine>
  P2P_REVOKE: mapping=<handle> src_device=<id> dst_device=<id> in_flight=<count> timed_out=<bool>
  P2P_ACL_ADD:    src_device=<id> dst_device=<id> cap=<token> max_bytes=<bytes>
  P2P_ACL_REMOVE: src_device=<id> dst_device=<id> reason=<cap_revoke|expired|quarantine>

These audit events are available for security monitoring and forensics.


44. Accelerator Isolation and Scheduling

44.1 Capability-Based Access Control

Every accelerator context is gated by the ISLE capability system:

// Extend existing cap_id constants in isle-driver-sdk/src/capability.rs

pub const ACCEL_COMPUTE: u32    = 0x0100;  // Submit compute work
pub const ACCEL_MEMORY: u32     = 0x0101;  // Allocate device memory
pub const ACCEL_P2P: u32        = 0x0102;  // P2P DMA transfers
pub const ACCEL_PREEMPT: u32    = 0x0103;  // Preempt other contexts (admin)
pub const ACCEL_PERF: u32       = 0x0104;  // Read performance counters
pub const ACCEL_POWER: u32      = 0x0105;  // Change power/clock state (admin)
pub const ACCEL_CONTEXT_RT: u32 = 0x0106;  // Create realtime-priority contexts

A container gets a capability token that says "50% compute time on GPU 0, 8GB VRAM limit." The kernel enforces this through the AccelScheduler and memory accounting.

44.2 Cgroup Integration

New cgroup controller: accel.

/sys/fs/cgroup/<group>/accel.devices
    # Which accelerators this cgroup can access (by device index)
    # Format: "0 1 3" (devices 0, 1, 3 allowed)

/sys/fs/cgroup/<group>/accel.memory.max
    # Maximum total device memory across all accelerators (bytes)
    # Format: "8589934592" (8GB)

/sys/fs/cgroup/<group>/accel.memory.current
    # Current device memory usage (read-only)

/sys/fs/cgroup/<group>/accel.compute.guarantee
    # Guaranteed compute bandwidth (microseconds per second, per device)
    # Format: "device_idx quota period"
    # Example: "0 500000 1000000" (50% of GPU 0 guaranteed)

/sys/fs/cgroup/<group>/accel.compute.max
    # Maximum compute bandwidth ceiling
    # Format: same as guarantee

/sys/fs/cgroup/<group>/accel.compute.weight
    # Relative share of excess compute time (like cpu.weight)
    # Default: 100

/sys/fs/cgroup/<group>/accel.priority
    # Default priority for contexts created by this cgroup
    # "background", "normal", "high"
    # ("realtime" requires ACCEL_CONTEXT_RT capability)

/sys/fs/cgroup/<group>/accel.stat
    # Usage statistics (read-only):
    #   compute_time_us <total compute microseconds>
    #   memory_current <current bytes>
    #   memory_peak <peak bytes>
    #   submissions <total command submissions>
    #   preemptions <times preempted by higher priority>
    #   faults <device page faults>
    #   migrations <pages migrated>

44.3 Memory Isolation

Each AccelContext has its own device-side page table (or partition of the device's address space). The kernel ensures:

  1. Context A cannot access Context B's device memory (separate page tables / address spaces).
  2. Context A cannot exceed its accel.memory.max cgroup limit.
  3. When Context A is destroyed, all its device memory is freed immediately.
  4. The OOM killer is aware of device memory. If a process is consuming excessive device memory and the device is under pressure, the OOM killer can target it.

Device Memory and the OOM Killer:

Device memory is additive to a process's OOM score — device_bytes is counted as an RSS equivalent. Before the OOM killer selects a victim, the kernel attempts soft reclaim: evicting device pages back to CPU RAM (using the migrate_pages vtable call). Per-device memory pressure is tracked with high/low watermarks; when the high watermark is breached, the AccelScheduler proactively triggers eviction of cold pages from device memory to CPU RAM. If a process exceeds its cgroup accel.memory.max limit, the allocation returns ENOMEM rather than triggering OOM kill — the process must handle the allocation failure gracefully.

44.4 Compute Time Isolation

The AccelScheduler enforces compute time limits per context:

  • accel.compute.max: Hard ceiling. If a context has used its quota for this period, its submissions are queued until the next period. (Same semantics as cpu.max.)
  • accel.compute.guarantee: Minimum bandwidth via CBS server. (Same algorithm as cpu.guarantee from Section 15.)
  • accel.compute.weight: Proportional sharing of compute time not covered by guarantees. (Same semantics as cpu.weight.)
  • Preemption: If a high-priority context's CBS server becomes runnable, the scheduler preempts the current context (if hardware supports it). Otherwise, the current submission runs to completion and the high-priority context gets the next slot.

44.5 Device Partitioning

For hardware that supports partitioning (NVIDIA MIG, AMD spatial partitioning):

/// Device partition descriptor.
#[repr(C)]
pub struct AccelPartition {
    /// Partition index.
    pub index: u32,
    /// Compute units assigned to this partition (same semantics as
    /// `AccelDeviceInfo::compute_units` — see Section 42.2.3).
    pub compute_units: u32,
    /// Memory assigned to this partition (bytes).
    pub memory_bytes: u64,
    /// Unique partition ID (for cgroup binding).
    pub partition_id: u64,
}

The device registry models partitions as child nodes of the GPU device node:

pci0000:00
  +-- 0000:41:00.0 (GPU, NVIDIA A100)
      +-- partition0 (MIG 2g.20gb: 28 SMs, 20GB)
      +-- partition1 (MIG 2g.20gb: 28 SMs, 20GB)
      +-- partition2 (MIG 3g.40gb: 42 SMs, 40GB)
      -- (108 total SMs, 10 reserved for system/L2 cache management; 98 available for MIG partitions)

Each partition is an independently schedulable and isolatable unit. Cgroups can be bound to specific partitions: echo "0000:41:00.0/partition0" > accel.devices.

44.6 GPU Virtualization Modes

GPUs support multiple virtualization approaches. The kernel accommodates all four without imposing one model.

Note: VFIO passthrough (mode 1) and SR-IOV (mode 2) are general-purpose PCIe virtualization technologies — widely used for NICs, NVMe controllers, FPGAs, and other devices. The general IOMMU/VFIO mechanism is described in Section 7.3.8. This section describes their specific application to GPUs and accelerators. Modes 3 (mdev) and 4 (MIG) are GPU/accelerator-specific.

1. PCIe Passthrough (VFIO):
   Entire GPU assigned to a single VM via IOMMU.
   Kernel role: IOMMU group management, VFIO device file (/dev/vfio/N).
   Performance: near-native. No sharing between VMs.
   Use case: dedicated GPU per VM (HPC, gaming).
   (This is the same VFIO mechanism used for NIC passthrough, NVMe
   passthrough, etc. — the interface is device-agnostic.)

2. SR-IOV (Single Root I/O Virtualization):
   Hardware creates Virtual Functions (VFs), each a separate PCIe
   device with its own BARs and IOMMU mapping.
   Kernel role: enumerate VFs, create device registry nodes for each,
   assign VFs to VMs via VFIO. AccelScheduler is NOT involved — each VF
   is an independent device scheduled by its own firmware.
   GPU support: Intel Data Center GPU Max, some AMD MI-series.
   Not supported by: NVIDIA consumer GPUs, most AMD consumer GPUs.
   (SR-IOV is more common for NICs — e.g., Intel E810, Mellanox
   ConnectX — where each VF provides an independent network interface.
   The kernel handles GPU VFs and NIC VFs identically at the PCIe level.)

3. Mediated Passthrough (mdev / vGPU):
   Software-defined GPU partitions, managed by the GPU driver.
   NVIDIA vGPU (GRID) and Intel GVT-g use this model.
   (AMD MxGPU uses SR-IOV, not mdev — see #2 above.)
   Kernel role:
     - mdev framework: /sys/bus/mdev/ device lifecycle (same as Linux).
     - Each mdev appears as a VFIO device to the VM (same API as #1).
     - The GPU driver (Tier 1) handles the actual time-slicing and
       memory partitioning internally.
     - AccelScheduler can enforce per-mdev cgroup limits if the driver
       exposes per-mdev utilization via get_utilization().
   Trade-off: more flexible than SR-IOV, but relies on driver quality.

4. MIG (Multi-Instance GPU) — NVIDIA A100/H100:
   Hardware partitioning into isolated GPU instances, each with
   dedicated SMs, memory controllers, and L2 cache.
   Modeled as child nodes in device registry (§44.5 above).
   Can be combined with VFIO: each MIG instance can be passed through
   to a separate VM.

The KABI AccelBase vtable supports all four modes — the kernel sees devices (physical, SR-IOV VF, MIG partition, or mdev) through the same interface. The virtualization mode is a configuration choice, not an architectural difference.


45. In-Kernel Inference Engine

45.1 Rationale

The kernel makes millions of decisions per second: which page to evict, which I/O to schedule next, which task to migrate, whether a health metric is anomalous. Today these decisions use hand-tuned heuristics. Machine learning can do better for pattern-dependent decisions.

45.2 Constraints

In-kernel inference is not general-purpose ML. It must satisfy:

  1. Deterministic execution time: Every inference call must complete within a bounded number of cycles. No data-dependent loops, no dynamic allocation. For multi-layer models (TinyNeuralNet), inference is preemptible at layer boundaries: the inference loop checks need_resched() between layers and yields if a higher-priority task or interrupt is pending. This ensures CPU inference (which may run for ~500ns-5μs) does not cause scheduling latency spikes on latency-sensitive systems.
  2. No floating point: Kernel code must not use FPU/SIMD registers (they belong to userspace). All computation uses integer/fixed-point arithmetic.
  3. Bounded memory: Model size is fixed at load time. No dynamic allocation during inference.
  4. Safe fallback: If the model produces nonsensical output, the system falls back to a traditional heuristic. The model is advisory, never authoritative for safety-critical decisions.
  5. Offline training: Models are trained in userspace (with full floating-point, GPU acceleration, unlimited time). Only the trained, quantized model is loaded into the kernel.

45.3 Supported Model Types

// isle-core/src/inference/mod.rs (kernel-internal)

pub enum KernelModelType {
    /// Decision tree / random forest.
    /// Bounded depth (max 32), bounded node count (max 64K).
    /// Inference = walk tree from root to leaf, O(depth) comparisons.
    /// Best for: classification decisions with <50 features.
    DecisionTree,

    /// Quantized lookup table.
    /// Input is quantized to N bits, output is a table lookup.
    /// O(1) inference. Best for: 1-2 dimensional functions.
    LookupTable,

    /// Quantized linear model.
    /// weights: [i16; N], bias: i32, threshold: i32.
    /// Inference = dot product + compare. O(N) multiplications.
    /// Best for: simple binary classification / regression.
    LinearModel,

    /// Quantized tiny neural network.
    /// Fixed architecture: input → hidden(s) → output.
    /// All weights INT8, activations INT8, accumulation INT32.
    /// Maximum: 4 layers, 256 neurons per layer.
    /// Inference time bounded by architecture constants.
    TinyNeuralNet,
}

45.4 Model Loading and Lifecycle

Models are trained in userspace and loaded into the kernel via a sysfs interface:

/sys/kernel/isle/inference/models/
    page_prefetch/
        model.bin       # Write: load model binary. Read: model metadata.
        active          # "1" to enable, "0" to disable, "heuristic" to fallback
        accuracy        # Read: online accuracy estimate (correct predictions / total)
        latency_ns      # Read: average inference latency
        invocations     # Read: total inference calls
        fallbacks       # Read: times fell back to heuristic
    io_scheduler/
        model.bin
        active
        accuracy
        latency_ns
        ...
    fma_anomaly/
        model.bin
        active
        ...
// isle-core/src/inference/model.rs (kernel-internal)

pub struct KernelModel {
    /// Model type determines the inference algorithm.
    pub model_type: KernelModelType,

    /// Model parameters (weights, tree nodes, lookup table).
    /// Allocated as a single contiguous block, fixed at load time.
    pub params: &'static [u8],

    /// Input feature count.
    pub input_features: u32,

    /// Output count (1 for regression, N for N-class classification).
    pub outputs: u32,

    /// Maximum inference latency in nanoseconds (pre-computed from model size).
    pub max_latency_ns: u64,

    /// Whether this model is currently active.
    pub active: AtomicBool,

    /// Online accuracy tracking.
    pub stats: ModelStats,
}

pub struct ModelStats {
    pub total_invocations: AtomicU64,
    pub correct_predictions: AtomicU64,  // When ground truth is available
    pub total_latency_ns: AtomicU64,
    pub fallback_count: AtomicU64,
}

45.5 Use Cases

Use case 1: Learned Page Prefetching

Replace Linux's simple sequential readahead with a learned prefetcher:

Input features (per page fault):
  - Faulting virtual address (quantized to page range)
  - Previous N fault addresses (pattern detector)
  - Process ID (different processes have different patterns)
  - Time since last fault
  - Memory region type (heap, stack, mmap, file-backed)

Output:
  - Next K pages to prefetch (page offsets relative to current fault)

Model type: TinyNeuralNet (2 hidden layers, 128 neurons each, INT8)
Inference time: ~500ns
Benefit: 20-40% reduction in page fault rate for workloads with learnable patterns
Fallback: Standard sequential readahead (Linux default)

Ground truth for online accuracy: track whether prefetched pages are actually accessed within a time window. If accuracy drops below threshold, fall back to heuristic.

Use case 2: I/O Scheduling

Optimize I/O queue ordering based on learned workload patterns:

Input features (per I/O request):
  - LBA (sector address, quantized)
  - Request size
  - Read vs write
  - Queue depth
  - Device utilization
  - Recent I/O pattern (last N requests, summarized)

Output:
  - Priority score (determines queue ordering)

Model type: DecisionTree (depth 16, ~4K nodes)
Inference time: ~200ns
Benefit: 5-15% IOPS improvement for mixed workloads
Fallback: mq-deadline heuristic

Use case 3: FMA Anomaly Detection

Detect correlated hardware degradation that threshold rules miss:

Input features (per health telemetry window):
  - ECC error rate (current window)
  - ECC error rate (historical baseline)
  - Temperature delta from baseline
  - PCIe correctable error rate
  - SMART attribute trends (multiple attributes)
  - Device age / power-on hours

Output:
  - Anomaly score (0.0 - 1.0 in fixed-point)
  - Predicted failure class (memory, storage, bus, thermal)

Model type: DecisionTree (depth 20, ~8K nodes)
Inference time: ~300ns
Benefit: Detect multi-signal degradation patterns that simple thresholds miss
Fallback: Threshold rules (Section 39.5)

Use case 4: Accelerator Memory Migration Policy

Decide which pages to migrate between CPU and GPU memory:

Input features (per page):
  - Access count on device (recent window)
  - Access count on CPU (recent window)
  - Time since last access
  - Page size
  - Current device memory pressure

Output:
  - Migrate to device / keep on CPU / evict from device

Model type: LinearModel (fast, simple)
Inference time: ~50ns per page
Benefit: Better migration decisions than fixed LRU
Fallback: LRU eviction

45.6 Safety Guarantees

The primary safety mechanism is mandatory structural validation at model load time (Section 45.7). The load-time validator statically proves that every accepted model terminates in bounded time before it is ever invoked:

  • Decision trees: Verify tree depth <= max_depth and that the tree is acyclic (DAG check). A tree of bounded depth with no cycles has a statically known maximum number of comparisons per inference.
  • Linear models: Verify the number of input features and outputs matches the declared dimensions. Inference is a single matrix-vector multiply with bounded operations.
  • Neural networks: Verify the layer count is fixed (no recurrent connections that could loop), and that the total multiply-accumulate (MAC) operations <= max_ops. A feedforward network with bounded layers and bounded dimensions has a statically known operation count.

Any model that cannot be statically proven to terminate in bounded time is rejected at load time and never reaches the inference path. This is the preemptive guarantee.

The post-hoc cycle check in infer_safe below remains as a defense-in-depth assertion — if the load-time validator has a bug and admits a model that runs longer than expected, the cycle check catches it and disables the model. But the cycle check is not the primary safety mechanism, because it measures cycles after run_inference completes; it cannot interrupt an infinite loop.

// isle-core/src/inference/safety.rs

/// Every model invocation goes through this wrapper.
/// The primary termination guarantee comes from load-time structural
/// validation (Section 45.7). This wrapper provides defense-in-depth:
/// post-hoc cycle checking, output validation, and fallback.
pub fn infer_safe<const MAX_NS: u64>(
    model: &KernelModel,
    input: &[i32],
    output: &mut [i32],
    fallback: impl FnOnce(&[i32], &mut [i32]),
) {
    // 1. Check model is active
    if !model.active.load(Ordering::Acquire) {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 2. Run inference with post-hoc cycle measurement.
    //    Termination is guaranteed by load-time structural validation
    //    (bounded tree depth / bounded layer count / bounded MAC ops).
    //    The cycle check is defense-in-depth only.
    //
    //    For TinyNeuralNet models, run_inference() checks need_resched()
    //    between layers. If a higher-priority task is pending, it yields
    //    and resumes after rescheduling. This adds ~5-20ns per layer
    //    boundary but prevents CPU inference from causing scheduling
    //    latency spikes. DecisionTree and LinearModel complete in <200ns
    //    and do not need preemption checks.
    //
    //    Because run_inference() may yield to the scheduler (via
    //    need_resched()), wall-clock TSC cycles include time spent
    //    sleeping/waiting — not actual model execution time. On a busy
    //    system, scheduler latency during a yield could push the
    //    wall-clock measurement far beyond MAX_NS even though the model
    //    executed correctly within its budget. To avoid false positives,
    //    we track whether a yield occurred during inference and skip the
    //    post-hoc cycle check in that case. The load-time structural
    //    validator (Section 45.7) is the primary termination guarantee;
    //    the cycle check is defense-in-depth that is only meaningful
    //    when the model ran without interruption.
    let mut yielded = false;
    let start = arch::current::cpu::read_cycle_counter();
    let result = model.run_inference(input, output, &mut yielded);
    let elapsed = arch::current::cpu::read_cycle_counter() - start;

    // 3. Defense-in-depth: check execution time was within bounds.
    //    This should never fire if the load-time validator is correct.
    //    If it does fire, it means the validator has a bug — disable
    //    the model and fall back to the heuristic.
    //    Skip the check if a yield occurred during inference, because
    //    the elapsed TSC cycles include scheduler/sleep time that is
    //    not attributable to the model. The load-time validator remains
    //    the primary safety mechanism regardless.
    if !yielded && elapsed > tsc_cycles_from_ns(MAX_NS) {
        model.active.store(false, Ordering::Release);
        fallback(input, output);
        log_warning!("inference model {} exceeded time bound (validator bug?), disabled", model.name);
        return;
    }

    // 4. Sanity-check output (model-specific validation)
    if !model.validate_output(output) {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 5. Update statistics
    model.stats.total_invocations.fetch_add(1, Ordering::Relaxed);
    model.stats.total_latency_ns.fetch_add(
        tsc_ns_from_cycles(elapsed), Ordering::Relaxed
    );
}

Execution time bounding: Static analysis (infer_safe) cannot prove termination for arbitrary GPU kernels (this is equivalent to the halting problem). Instead, ISLE uses a hardware watchdog approach: - Every accelerator dispatch includes a max_execution_us: u64 timeout (default: 5,000,000 μs = 5 seconds for compute kernels, 16,667 μs = 1/60th second for graphics). - The driver programs the GPU's hardware watchdog timer (GFX_TIMEOUT on AMD, TDR on NVIDIA via VGPU, or software timer for accelerators without hardware watchdog) before submitting the kernel. - If the kernel exceeds max_execution_us, the GPU engine is reset (per-engine reset, not full GPU reset where hardware supports it) and the dispatch returns ETIMEDOUT to the submitter. Other engines and other contexts are not affected. - infer_safe provides a BEST-EFFORT execution time estimate when possible (e.g., bounded loop counts × estimated per-iteration cost), which is used to set a tighter watchdog timeout. Kernels that infer_safe cannot analyze use the default timeout.

Hardware-accelerated inference timeout (NPU/DSP offload):

The infer_safe wrapper above covers CPU-side inference only. When a TinyNeuralNet model is offloaded to a hardware accelerator (e.g., an NPU or DSP via the AccelBase KABI), the post-hoc cycle counter cannot bound execution: once a command buffer is submitted to the device, the CPU cannot observe or interrupt mid-execution (for non-preemptible devices, see Section 42.2.4).

For hardware-offloaded inference, the scheduler enforces a hardware command timeout using the max_execution_us field in AccelContextLimits (Section 42.2.3). The inference subsystem creates a dedicated AccelContext for each model with a tight timeout derived from KernelModel::max_latency_ns:

// isle-core/src/inference/hw_offload.rs (kernel-internal)

/// Submit an inference request to a hardware accelerator with a hard timeout.
/// If the device does not complete within `model.max_latency_ns * HW_TIMEOUT_MULTIPLIER`,
/// the AccelScheduler cancels the submission and `infer_hw` falls back to the CPU path.
///
/// HW_TIMEOUT_MULTIPLIER = 4 — allows for device load variance while still bounding
/// worst-case lock-up. A misbehaving model cannot hold the accelerator indefinitely.
pub fn infer_hw(
    model: &KernelModel,
    ctx: &AccelContextHandle,
    input: &[i32],
    output: &mut [i32],
    fallback: impl FnOnce(&[i32], &mut [i32]),
) {
    // 1. Submit inference command buffer to hardware.
    //    AccelContextLimits::max_execution_us is set to
    //    (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 at context creation.
    //    The AccelScheduler arms a kernel timer on submission; if the device does not
    //    complete before the timer fires, it calls preempt_context() or, for
    //    non-preemptible devices, marks the submission Timeout at the next boundary.
    let submit_result = accel_submit_inference(ctx, model, input);
    if submit_result.is_err() {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 2. Wait for completion with the same timeout bound.
    //    poll_completion() returns Timeout if max_execution_us elapsed.
    match accel_poll_completion(ctx, model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) {
        AccelCompletionStatus::Success => { /* copy results from device buffer to output */ }
        AccelCompletionStatus::Timeout | AccelCompletionStatus::Error => {
            // Device did not complete in time — fall back to CPU heuristic.
            // Disable hardware offload for this model after repeated timeouts.
            model.hw_timeout_count.fetch_add(1, Ordering::Relaxed);
            if model.hw_timeout_count.load(Ordering::Relaxed) > HW_TIMEOUT_DISABLE_THRESHOLD {
                model.hw_offload_enabled.store(false, Ordering::Release);
                log_warning!("inference model {} exceeded hw timeout {} times, disabling hw offload",
                    model.name, HW_TIMEOUT_DISABLE_THRESHOLD);
            }
            fallback(input, output);
            model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        }
        AccelCompletionStatus::Preempted => {
            // Higher-priority work displaced the inference — fall back to CPU.
            fallback(input, output);
            model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        }
    }
}

The AccelContext for in-kernel inference is created at model load time with: - max_execution_us = (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 - priority = AccelPriority::Background (inference is advisory; never starves user workloads) - max_memory_bytes = model parameter size + fixed scratch buffer (no dynamic allocation)

This ensures that a misbehaving model or faulty NPU firmware cannot lock the accelerator indefinitely. The kernel timer in the AccelScheduler provides the hard bound; the CPU fallback path ensures the kernel continues operating correctly even if the hardware accelerator is unresponsive.

45.6a Adversarial Robustness

Section 45.6 addresses runtime safety (cycle budgets, fallback, output clamping). This section addresses a different threat: adversarial inputs — workload patterns deliberately crafted to exploit learned kernel models.

Threat model — an unprivileged attacker crafts memory access patterns, I/O sequences, or scheduling behavior designed to: 1. Degrade performance: trick the page prefetcher into evicting hot pages, or trick the I/O scheduler into making sub-optimal ordering decisions 2. Denial of service: cause the model to consistently produce worst-case outputs, degrading system throughput for co-tenants on the same machine 3. Information leakage: infer information about other processes' behavior by observing how the model's decisions change in response to probing inputs

Why kernel models are partially resistant — unlike ML models in adversarial ML research (image classifiers, NLP), kernel models operate on aggregate statistics (page fault rate over the last 1000 faults, I/O queue depth histogram), not on raw inputs from a single source. An attacker controls only their own process's behavior, which is one signal among many feeding into the model.

Mitigations:

  1. Per-process model state isolation: page prefetch and I/O scheduling models maintain per-process (per-cgroup) input features. Attacker process A's access pattern cannot directly influence model decisions for victim process B. The model sees A's features and B's features independently.

  2. Output clamping (already in Section 45.6): model output is clamped to a safe range. Even the worst possible model output (prefetch the maximally wrong pages, schedule I/O in the worst possible order) produces bounded degradation — the system falls back to LRU/FIFO behavior, which is the same as having no model at all. The adversary's best attack reduces performance to the no-model baseline, not below it.

  3. Anomaly detection on model inputs: the infer_safe wrapper tracks input feature distributions. If input features drift significantly from the training distribution (measured by simple range checks and mean/variance tracking), the model is automatically disabled for that process/cgroup and the heuristic fallback activates. This prevents an attacker from driving the model into an untrained region of the input space.

  4. Model-decision audit tracing: all model decisions are available via stable tracepoints (Section 40). Security-sensitive deployments can monitor for anomalous model behavior (e.g., prefetch hit rate dropping below a threshold for a specific cgroup) and trigger investigation.

  5. Side-channel hardening: the model's decision is not directly observable by unprivileged processes. An attacker cannot call "what did the prefetcher decide?" — they can only observe timing (whether a page fault occurred or not). This is equivalent to the existing cache side-channel problem, not a new attack surface. Standard cache partitioning mitigations (CAT, MBA) apply equally to model-influenced cache behavior.

Accepted risk — a sophisticated attacker with co-tenant access can degrade their own performance or (with difficulty) slightly degrade co-tenant prefetch accuracy, but cannot cause worse-than-baseline behavior (output clamping), cannot crash the kernel (cycle watchdog), and cannot corrupt other processes' data (model outputs are advisory — they influence which pages to prefetch, not which pages are accessible). This risk profile is comparable to existing cache pollution attacks, which are accepted in shared environments.

45.6b Fallback Mode Safety Specification

When in-kernel inference encounters errors, timeouts, or hardware failures, the system must transition to a safe fallback mode without compromising system integrity. This section specifies the safety guarantees that MUST be maintained even when fallback is active.

45.6b.1 Fallback Trigger Conditions

Fallback mode is triggered by any of the following conditions:

Condition Scope Automatic Recovery
Model disabled (active=false) Per-model Yes, when admin re-enables
Cycle budget exceeded Per-model No, requires admin review
Hardware offload timeout (NPU/DSP) Per-model + per-device Yes, after cooldown period
Output validation failure Per-model, per-invocation Yes, model auto-disables after threshold
Input distribution anomaly Per-cgroup, per-model Yes, after distribution normalizes
Accelerator device failure Per-device No, requires device reset
Model binary signature failure Per-model No, requires valid signed model

Critical invariant: Fallback mode NEVER disables safety-critical kernel checks. The inference engine provides advisory decisions only (page prefetch hints, I/O ordering hints). The following safety guarantees remain active unconditionally:

  • Memory access validation (page tables, permission checks)
  • Capability checks for all resource accesses
  • I/O command validation (DMA bounds, register ranges)
  • Interrupt masking and priority enforcement
  • Scheduler fairness and preemption guarantees
  • Watchdog timers for all hardware operations

45.6b.2 Fallback Scope and Isolation

Fallback is NEVER system-wide. The isolation granularity is:

  1. Per-model: Each inference model (page_prefetch, io_scheduler, etc.) operates independently. A failure in one model does not affect others.

  2. Per-cgroup for input anomaly: If a specific cgroup's input features drift outside the training distribution, fallback activates for that cgroup only. Other cgroups continue using the model normally.

  3. Per-device for hardware offload: If NPU A times out, models offloaded to NPU A fall back to CPU inference. NPU B continues operating normally.

// isle-core/src/inference/fallback.rs (kernel-internal)

/// Fallback state for a single model instance.
/// This is tracked independently per (model, cgroup) pair.
pub struct FallbackState {
    /// Why fallback is active (None = not in fallback)
    pub reason: Option<FallbackReason>,

    /// Timestamp when fallback mode was entered (monotonic clock)
    pub entered_at: Option<u64>,

    /// Number of consecutive fallback invocations
    pub consecutive_count: u64,

    /// Maximum consecutive fallbacks before escalation
    pub escalation_threshold: u64,
}

/// Reasons for entering fallback mode
pub enum FallbackReason {
    /// Model explicitly disabled by admin
    ModelDisabled,

    /// Cycle budget exceeded (validator bug suspected)
    CycleBudgetExceeded,

    /// Hardware accelerator timeout
    HwTimeout { device_id: u32, timeout_us: u64 },

    /// Output failed model-specific validation
    OutputValidationFailed,

    /// Input features outside training distribution
    InputDistributionAnomaly {
        feature_index: u32,
        observed_value: i32,
        expected_range: (i32, i32),
    },

    /// Accelerator device reported error
    DeviceError { device_id: u32, error_code: u32 },
}

45.6b.3 Fallback Duration and Escalation

Fallback mode has bounded duration with automatic escalation:

Duration Action
0-60 seconds Automatic recovery: continue using heuristic fallback, periodically retry model
60-300 seconds Escalation: emit kernel warning event, log to audit trail
300+ seconds Critical escalation: emit high-priority alert to system management daemon
3600+ seconds (configurable) Model auto-disables: set active=false, require admin to re-enable
/// Fallback duration thresholds (configurable via sysfs)
pub const FALLBACK_RETRY_INTERVAL_SECS: u64 = 10;    // Retry model every 10s
pub const FALLBACK_WARNING_SECS: u64 = 60;           // Log warning after 60s
pub const FALLBACK_ESCALATION_SECS: u64 = 300;       // Alert after 5 minutes
pub const FALLBACK_AUTO_DISABLE_SECS: u64 = 3600;    // Disable model after 1 hour

Recovery attempts: While in fallback mode, the kernel periodically attempts to use the model again (every FALLBACK_RETRY_INTERVAL_SECS). If the model succeeds three consecutive times, fallback mode exits and normal operation resumes. This handles transient hardware glitches without operator intervention.

Escalation path:

Fallback entered
    ↓
[Retry every 10s, up to 60s]
    ↓ (still failing)
Log warning: "Model X in fallback for 60s, reason=Y"
    ↓
[Continue retries, up to 300s]
    ↓ (still failing)
Emit event: isle_inference_fallback(model="X", duration_s=300, reason="Y")
    ↓
[Continue retries, up to 3600s]
    ↓ (still failing)
Auto-disable model, emit critical event
    ↓
Require admin to re-enable via sysfs write

45.6b.4 Admin Intervention Requirements

Certain fallback conditions require explicit admin action to exit fallback mode:

Requires admin intervention: - Model disabled due to cycle budget exceeded (indicates validator bug) - Accelerator device failure (requires device reset or replacement) - Model binary signature verification failure

Automatic recovery allowed: - Hardware timeout (transient NPU load) - Input distribution anomaly (workload shifted temporarily) - Output validation failure (if under threshold)

The sysfs interface exposes recovery control:

/sys/kernel/isle/inference/models/page_prefetch/
    fallback_reason      # Read: current fallback reason, empty if not in fallback
    fallback_duration_ms # Read: milliseconds since fallback entered
    fallback_auto_recover  # Write: "1" to attempt immediate recovery, "0" to stay in fallback
    require_admin_reset  # Read: "1" if admin action required to exit fallback

To exit a fallback state requiring admin intervention:

# Check current state
cat /sys/kernel/isle/inference/models/page_prefetch/fallback_reason
# Output: "cycle_budget_exceeded"

# This condition requires admin review
cat /sys/kernel/isle/inference/models/page_prefetch/require_admin_reset
# Output: "1"

# After reviewing logs and validating the model is correct, admin resets:
echo 1 > /sys/kernel/isle/inference/models/page_prefetch/active

45.6b.5 Audit Logging

All fallback state transitions are logged to the kernel audit subsystem:

/// Audit event emitted on fallback state changes
pub struct FallbackAuditEvent {
    /// Timestamp (monotonic nanoseconds since boot)
    pub timestamp_ns: u64,

    /// Model identifier (e.g., "page_prefetch")
    pub model_name: &'static str,

    /// Cgroup ID (0 if fallback is model-wide)
    pub cgroup_id: u64,

    /// Device ID (0 if not device-related)
    pub device_id: u32,

    /// Event type
    pub event_type: FallbackEventType,

    /// Reason for the event
    pub reason: FallbackReason,
}

pub enum FallbackEventType {
    /// Entered fallback mode
    Entered,

    /// Automatic recovery attempt (retrying model)
    RecoveryAttempt,

    /// Successfully recovered, exiting fallback
    Recovered,

    /// Escalation: warning logged
    EscalationWarning,

    /// Escalation: critical alert emitted
    EscalationCritical,

    /// Model auto-disabled due to prolonged fallback
    AutoDisabled,

    /// Admin re-enabled model after manual review
    AdminReset,
}

Audit log entries (visible via tracefs and forwarded to system audit daemon):

[12345.678901] ISLE-INFERENCE: model=page_prefetch event=entered reason=hw_timeout device=4 timeout_us=2000
[12355.678901] ISLE-INFERENCE: model=page_prefetch event=recovery_attempt
[12355.679001] ISLE-INFERENCE: model=page_prefetch event=recovered
[12405.678901] ISLE-INFERENCE: model=io_scheduler event=entered reason=input_anomaly cgroup=1234 feature=2
[12465.678901] ISLE-INFERENCE: model=io_scheduler event=escalation_warning duration_s=60

Audit retention: Fallback audit events are retained for a minimum of 30 days (configurable) to support post-incident analysis. Security-sensitive deployments MAY configure longer retention or external log forwarding.

45.6b.6 Fallback Heuristic Specification

When fallback is active, the system MUST use a well-defined heuristic that provides correct (if suboptimal) behavior:

Model Fallback Heuristic
Page prefetch Sequential readahead: next 4 pages from current fault address
I/O scheduling mq-deadline: order by LBA, 50ms deadline for reads, 500ms for writes
FMA anomaly Static thresholds: trigger alert if any single metric exceeds hard limit
Memory migration Access-count based: migrate if device_access_count > cpu_access_count × 2
Power budget Equal distribution: allocate total_power / num_domains to each domain

Critical safety property: Fallback heuristics MUST NOT bypass any safety checks. For example, the sequential readahead fallback still validates that: - The target pages fall within the process's virtual address space - The underlying physical pages are accessible (not protected, not corrupt) - The readahead does not exceed the process's memory limits

The fallback heuristic is simply a decision algorithm for which pages to prefetch or which I/O to prioritize — it does NOT bypass the memory management or I/O subsystem's safety validation layers.

45.7 Model Binary Format

In-kernel models use a simple binary format for loading:

Header (4790 bytes — larger than original 256 bytes due to ML-DSA signature):
  magic:          u32    = 0x49534C45 ("ISLE")
  version:        u32    = 1
  model_type:     u32    (KernelModelType discriminant)
  input_features: u32
  outputs:        u32
  param_size:     u64    (bytes of parameter data following header)
  max_latency_ns: u64    (pre-computed worst-case inference time)
  sha256_hash:    [u8; 32]  (SHA-256 over entire model file: header fields above
                              [magic through max_latency_ns] concatenated with parameter
                              data — integrity check covering both structure and weights)
  ed25519_sig:    [u8; 64]  (Ed25519 signature over entire model — authentication)
  mldsa_sig_len:  u16       (actual ML-DSA signature length; ML-DSA-65 = 3309, ML-DSA-87 = 4627)
  mldsa_sig:      [u8; 4627] (ML-DSA signature over entire model — post-quantum authentication;
                              sized for largest variant ML-DSA-87; actual length in mldsa_sig_len)
  _reserved:      [u8; 29]  (must be zero)

Parameter data: [u8; param_size]

Model binaries are verified using the same hybrid signature scheme as kernel modules (Section 22). The SHA-256 hash covers the entire model file (header + parameters), providing integrity over both the model structure and weights — an attacker cannot modify the header to reinterpret parameter data without invalidating the hash. The Ed25519 + ML-DSA signatures over the entire model provide authentication. Both signatures must verify against the kernel's built-in model signing public keys. Unsigned or incorrectly signed models are rejected unless the system is booted with accel.allow_unsigned=1 (disabled by default, requires CAP_ACCEL_ADMIN). This prevents an attacker from loading arbitrary model binaries even if they can construct one with a matching SHA-256 hash.

Load-time validation:

  1. Verify magic bytes (0x49534C45).
  2. Check param_size against maximum allowed model size (configurable, default 1 MB).
  3. Verify SHA-256 hash of entire model file (header fields magic through max_latency_ns concatenated with parameter data) matches sha256_hash.
  4. Signature verification — verify both Ed25519 and ML-DSA signatures over the entire model file (header + parameters) against the kernel's built-in model signing public keys. The ML-DSA signature uses the length indicated by mldsa_sig_len. Reject if either signature fails (unless accel.allow_unsigned=1).
  5. Structural termination proof — the validator statically proves bounded execution based on model type:
  6. Decision trees: Verify tree depth <= max_depth and that the tree is acyclic (DAG check via topological sort). Reject if any cycle is found or depth exceeds the configured maximum.
  7. Linear models: Verify input/output dimensions match the declared input_features and outputs fields. A single matrix-vector multiply is inherently bounded.
  8. Neural networks: Verify the layer count is fixed (no recurrent connections), and that total multiply-accumulate operations <= max_ops (computed from layer dimensions). Reject any model with recurrent or self-referencing layers. Any model that cannot be statically proven to terminate in bounded time is rejected. This is the primary safety mechanism — see Section 45.6.
  9. Compute worst-case latency from the proven model structure (tree depth, layer count, MAC operations) and reject if it exceeds the target subsystem's latency budget.

Atomic replacement: Models are replaced via double-buffer. The new model is loaded into a shadow slot while the current model continues serving. Once the new model is fully loaded and validated, an atomic pointer swap activates it. The old model is freed after a grace period (RCU-style) to ensure no in-flight inference uses stale data.

45.8 Model Drift Detection and Retraining Pipeline

Models trained on one workload profile may degrade when hardware changes (new storage devices with different latency characteristics), workload shifts (database server repurposed as a build server), or system configuration changes (memory added, CPUs hotplugged). This section addresses how ISLE detects and responds to model drift.

Online accuracy tracking — every model maintains a correct_predictions counter (Section 45.4 ModelStats). For the page prefetcher, a "correct prediction" means the prefetched page was actually accessed within a configurable window (~100ms). For the I/O scheduler, it means the predicted optimal ordering actually reduced tail latency. The kernel continuously computes a rolling accuracy rate:

accuracy = correct_predictions / total_invocations (over last 60 seconds)

Drift detection thresholds — the model is automatically disabled (falling back to heuristic) when accuracy drops below a configurable threshold:

/sys/kernel/isle/inference/models/page_prefetch/
    accuracy              # Current rolling accuracy (read-only)
    accuracy_threshold    # Minimum acceptable accuracy (read/write, default: 0.60)
    drift_detected        # "1" if accuracy < threshold (read-only)
    auto_disable          # "1" to auto-disable on drift (read/write, default: "1")

When drift_detected transitions to 1: 1. Model is disabled, heuristic fallback activates immediately 2. A kernel event is emitted: isle_inference_drift(model="page_prefetch", accuracy=0.52) 3. The event is visible via stable tracepoints (Section 40) and sysfs

Retraining pipeline — model retraining is deliberately out-of-kernel. The kernel collects training data; userspace trains models; the kernel loads the result:

┌─────────────┐    trace data    ┌──────────────┐   model.bin   ┌─────────────┐
│ ISLE Kernel  │ ─────────────→  │  Userspace    │ ───────────→  │ ISLE Kernel  │
│ (inference)  │  sysfs/tracefs  │  Trainer      │  sysfs write  │ (inference)  │
│              │                 │  (isle-mltool)│               │              │
│ Emits:       │                 │ - Reads trace │               │ Atomic swap: │
│ - features   │                 │ - Trains model│               │ new model    │
│ - outcomes   │                 │ - Quantizes   │               │ replaces old │
│ - accuracy   │                 │ - Validates   │               │              │
└─────────────┘                 └──────────────┘               └─────────────┘

The isle-mltool userspace utility (shipped with ISLE) automates the pipeline: 1. Reads training features from tracefs ring buffer 2. Trains a decision tree or lookup table using the collected features + outcomes 3. Quantizes to int8/int16 4. Validates against a holdout set 5. Writes the new model to sysfs (atomic replacement)

This can run as a cron job, a systemd timer, or triggered by the drift detection event. The kernel never trains a model itself — training requires floating-point, unbounded memory, and access to training libraries, all of which belong in userspace.

Why not online learning in-kernel? — Online learning (updating model weights incrementally as new data arrives) would eliminate the retraining round-trip. However: - Online learning requires floating-point arithmetic (kernel uses integer-only inference) - Convergence guarantees are weaker (adversarial inputs could steer the model — see Section 45.6a) - Deterministic execution is impossible to guarantee with continuous weight updates - The kernel's cycle budget (~500-5000 cycles per inference) leaves no room for gradient computation

The offline-train / online-infer split is deliberate: the kernel is a fast, dumb inference engine; intelligence lives in userspace where it can be debugged, validated, and rolled back.

45.9 Tier 2 Inference Services

Section 45 defines in-kernel inference: tiny integer-only models running in the kernel's cycle budget (~500-5000 cycles) for per-decision hot paths. But many kernel optimization decisions don't need per-I/O latency — they're slow-path, strategic decisions made every few seconds or minutes. These can benefit from much more powerful models running in Tier 2 userspace drivers.

The architectural fit — Tier 2 drivers (Section 3) run as isolated userspace processes communicating with isle-core via ring buffers. They can: - Use floating-point and SIMD (no kernel FP restrictions) - Allocate unlimited heap memory - Link against full ML frameworks (ONNX Runtime, XGBoost, scikit-learn, PyTorch) - Access GPU/NPU accelerators via isle-accel (Section 42) - Crash without affecting the kernel (full process isolation)

This makes Tier 2 the natural home for advanced AI capabilities that exceed what the in-kernel inference engine can provide.

Tier 2 inference service model:

┌─────────────────────────────────────────────────────────┐
│ ISLE Core (Ring 0)                                      │
│                                                         │
│  In-kernel inference (45):       Tier 2 IPC client:     │
│  - Decision trees (fast)         - Ring buffer query →  │
│  - Lookup tables (fast)          - ← Ring buffer reply  │
│  - ~500-5000 cycles              - ~50-200 μs round-trip│
│  - Hot path (per-I/O)            - Warm path (periodic) │
│                                                         │
└──────────────────────┬──────────────────────────────────┘
                       │ Ring buffer IPC
                       │ (shared memory, zero-copy)
┌──────────────────────▼──────────────────────────────────┐
│ Tier 2 Inference Service (Userspace process)            │
│                                                         │
│  - Full FP, GPU access, unlimited memory                │
│  - GBM / Random Forest / small transformer models       │
│  - Online learning (incremental weight updates)         │
│  - Training data ingest from kernel ring buffer         │
│  - Model export back to in-kernel engine                │
│                                                         │
└─────────────────────────────────────────────────────────┘

Use cases for Tier 2 inference services:

Decision Frequency In-kernel (45) Tier 2 service
Page prefetch (which pages next?) Per fault (~μs) Decision tree N/A — too latency-sensitive
I/O scheduling (reorder queue) Per I/O (~μs) Lookup table N/A — too latency-sensitive
NUMA rebalancing (move pages?) Every ~10s Too complex GBM model: predict migration benefit from memory access histograms
Compression algorithm selection Per cgroup, every ~30s N/A Random forest: select LZ4/Zstd/none based on cgroup's page entropy distribution
Swap tier selection (Section 47) Every ~5s N/A Regression model: predict optimal local-vs-remote swap ratio from RDMA latency measurements
Anomaly detection (FMA, Section 39) Every ~1s N/A Autoencoder or isolation forest: detect anomalous device behavior patterns
Power budget optimization (Section 49) Every ~1s N/A RL agent: learn optimal RAPL power limits given current workload mix
Driver crash prediction Every ~5s N/A Time-series model: predict imminent driver failure from error rate trends

Online learning — safely in Tier 2:

The in-kernel inference engine cannot do online learning (Section 45.8 explains why: no floating-point, no convergence guarantees, adversarial risk). But a Tier 2 service can, because:

  1. Full FP available: Tier 2 runs in userspace with standard libm, SSE/AVX, GPU
  2. Crash-safe: if the online learning algorithm diverges or crashes, the Tier 2 process restarts. The kernel falls back to its in-kernel heuristic. No kernel impact.
  3. Adversarial resistance: the Tier 2 service can implement proper input validation, outlier detection, and gradient clipping — techniques too expensive for in-kernel but standard practice in ML engineering
  4. Auditable: the Tier 2 service's model weights, training data, and decisions can be logged, inspected, and rolled back using standard userspace debugging tools

The online learning loop:

1. Kernel emits training features to Tier 2 service via ring buffer
   (page fault patterns, I/O latencies, scheduling decisions + outcomes)
2. Tier 2 service updates model incrementally (mini-batch SGD, online RF, etc.)
3. Periodically, Tier 2 service quantizes a snapshot of its model to int8/int16
4. Tier 2 service writes the quantized model to the kernel via sysfs
   (atomic model replacement — Section 45.4)
5. In-kernel inference engine picks up the new model instantly
6. Kernel continues fast, deterministic, integer-only inference with fresh weights

This gives the system continuous adaptation — the in-kernel model is refreshed every few minutes with weights learned from the actual running workload — without any of the risks of in-kernel online learning. The Tier 2 service absorbs all the complexity (floating-point, convergence, validation), and the kernel sees only a stream of pre-validated, quantized model snapshots.

Tier 2 inference service KABI:

/// KABI interface for Tier 2 inference services.
/// Registered via the standard Tier 2 driver mechanism.
#[repr(C)]
pub struct InferenceServiceVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Kernel sends a query to the inference service.
    /// Returns immediately — result arrives asynchronously via ring buffer.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer and
    /// `features` points to at least `feature_count` valid i32 values.
    pub submit_query: unsafe extern "C" fn(
        service: *mut ServiceContext,
        query_type: InferenceQueryType,
        features: *const i32,
        feature_count: u32,
    ) -> KabiResult,

    /// Kernel retrieves the latest model snapshot (int8/int16 quantized).
    /// Called periodically to update the in-kernel inference engine.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer and
    /// `buffer` points to at least `buffer_len` bytes of writable memory.
    pub get_model_snapshot: unsafe extern "C" fn(
        service: *mut ServiceContext,
        model_type: KernelModelType,
        buffer: *mut u8,
        buffer_len: usize,
    ) -> KabiResult,

    /// Kernel reports ground truth (outcome of a previous prediction).
    /// Enables online accuracy tracking in the Tier 2 service.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer.
    pub report_outcome: unsafe extern "C" fn(
        service: *mut ServiceContext,
        query_id: u64,
        outcome: i32,
    ) -> KabiResult,
}

#[repr(u32)]
pub enum InferenceQueryType {
    /// Should we rebalance NUMA memory for this cgroup?
    NumaRebalance = 0,
    /// Which compression algorithm for this cgroup?
    CompressionSelect = 1,
    /// Optimal swap tier ratio (local vs remote)?
    SwapTierSelect = 2,
    /// Anomaly score for this device's health metrics?
    DeviceAnomaly = 3,
    /// Optimal power budget for current workload mix?
    PowerBudget = 4,
}

Latency budget — Tier 2 IPC round-trip is ~50-200μs (ring buffer submission + context switch + userspace inference + ring buffer reply). This is acceptable for decisions made every 1-30 seconds. For any decision needed per-I/O or per-fault, the in-kernel inference engine (Section 45) remains the only option.

Fallback hierarchy:

Tier 2 inference service available?
├── Yes → Use Tier 2 for slow-path decisions, in-kernel for hot-path
│         Tier 2 periodically refreshes in-kernel model weights
│
└── No  → In-kernel inference engine with static model
          ├── Model loaded? → Use model
          └── No model?     → Heuristic fallback (LRU, FIFO, static thresholds)

The system is fully functional at every level of the fallback hierarchy. Tier 2 inference services are a performance optimization, not a requirement. A system with no ML models at all behaves identically to a traditional kernel using standard heuristics.

Shipped Tier 2 services — ISLE ships the following reference Tier 2 inference services (optional, loaded on demand):

Service Model type Purpose
isle-ml-numa Gradient-boosted trees NUMA page placement and migration decisions
isle-ml-compress Random forest Per-cgroup compression algorithm selection
isle-ml-anomaly Isolation forest FMA device anomaly detection
isle-ml-power Online RL (contextual bandit) RAPL power budget optimization

Each is a standalone Tier 2 driver (~500-2000 LOC) that implements the InferenceServiceVTable. They are independently upgradable, independently crashable (Tier 2 process restart in ~10ms), and independently optional.


46. Accelerator Networking, RDMA, and Linux GPU Compatibility

46.1 RDMA and Collective Operations

46.1.1 Problem

Distributed ML training on N GPUs across M machines requires:

  • All-Reduce: After computing gradients on each GPU, average them across all GPUs. This is the single most important collective operation in distributed training.
  • All-Gather: Assemble distributed tensor shards.
  • Reduce-Scatter: Reduce and distribute results.
  • Point-to-point: Direct GPU-to-GPU transfer across network.

These operations need RDMA (Remote Direct Memory Access) for low latency and high throughput. Linux provides RDMA through the rdma-core / libibverbs userspace stack, but the kernel's role is minimal (IB/verbs kernel interface is thin).

46.1.2 Design: Kernel-Assisted Collectives

The kernel can optimize collectives by understanding the topology:

8-GPU training cluster:
  Machine A: GPU 0, GPU 1 (NVLink connected)
  Machine B: GPU 2, GPU 3 (NVLink connected)
  Machine C: GPU 4, GPU 5 (NVLink connected)
  Machine D: GPU 6, GPU 7 (NVLink connected)
  All machines connected via 100GbE RDMA

Optimal all-reduce strategy:
  1. Intra-machine: GPU 0 ↔ GPU 1 via NVLink (200 GB/s)
  2. Inter-machine: GPU 0 ↔ GPU 2 ↔ GPU 4 ↔ GPU 6 via RDMA (12.5 GB/s)
  3. Intra-machine: Broadcast result GPU 0 → GPU 1, etc.

The kernel knows the topology (device registry + network fabric).
Userspace NCCL/RCCL libraries currently discover this themselves.
The kernel can provide it as a service.

46.1.3 RDMA KABI Interface

/// RDMA device vtable (extends AccelBase for RDMA-capable NICs).
#[repr(C)]
pub struct RdmaDeviceVTable {
    pub vtable_size: u64,
    pub version: u32,

    /// Create a protection domain (PD).
    pub create_pd: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_pd: *mut RdmaPdHandle,
    ) -> IoResultCode,

    /// Register a memory region for RDMA.
    /// The registered region can be targeted by remote DMA.
    pub register_mr: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        addr: u64,         // Virtual or physical address
        size: u64,
        access: RdmaAccessFlags,
        out_mr: *mut RdmaMrHandle,
        out_rkey: *mut u32,
    ) -> IoResultCode,

    /// Register device memory (GPU VRAM) for RDMA.
    /// Enables GPUDirect RDMA: remote machine can DMA directly to/from GPU.
    pub register_device_mr: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        accel_mem: AccelMemHandle,
        offset: u64,
        size: u64,
        access: RdmaAccessFlags,
        out_mr: *mut RdmaMrHandle,
        out_rkey: *mut u32,
    ) -> IoResultCode>,

    /// Post a send work request.
    pub post_send: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        wr: *const RdmaSendWr,
    ) -> IoResultCode,

    /// Post a receive work request.
    pub post_recv: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        wr: *const RdmaRecvWr,
    ) -> IoResultCode,

    /// Poll completion queue.
    pub poll_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        cq: RdmaCqHandle,
        max_entries: u32,
        out_wc: *mut RdmaWorkCompletion,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Create a queue pair (QP) for RDMA communication.
    pub create_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        qp_type: RdmaQpType,
        send_cq: RdmaCqHandle,
        recv_cq: RdmaCqHandle,
        out_qp: *mut RdmaQpHandle,
    ) -> IoResultCode,

    /// Destroy a queue pair.
    pub destroy_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
    ) -> IoResultCode,

    /// Create a completion queue (CQ).
    pub create_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        min_entries: u32,
        out_cq: *mut RdmaCqHandle,
    ) -> IoResultCode,

    /// Destroy a completion queue.
    pub destroy_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        cq: RdmaCqHandle,
    ) -> IoResultCode,

    /// Deregister a memory region.
    pub deregister_mr: unsafe extern "C" fn(
        ctx: *mut c_void,
        mr: RdmaMrHandle,
    ) -> IoResultCode,

    /// Destroy a protection domain.
    pub destroy_pd: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
    ) -> IoResultCode,

    // === Connection Management (RC queue pairs) ===

    /// Resolve a route to a remote RDMA destination and initiate connection.
    /// Transitions the QP from INIT → RTR → RTS (for RC transport).
    /// `dest` contains the remote node's LID/GID, QP number, and path info.
    pub connect_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        dest: *const RdmaConnParams,
    ) -> IoResultCode,

    /// Bind a QP to a local port and listen for incoming RC connections.
    /// The QP must be in the INIT state. When a remote peer calls
    /// `connect_qp` targeting this endpoint, the kernel invokes the
    /// `conn_request_callback` registered via `set_conn_callback`.
    pub listen_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        port: u8,
    ) -> IoResultCode,

    /// Accept an incoming RC connection request on a listening QP.
    /// Transitions the QP to RTS. `request_id` is the identifier
    /// provided by the `conn_request_callback`.
    pub accept_conn: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        request_id: u64,
        resp_params: *const RdmaConnParams,
    ) -> IoResultCode,

    /// Gracefully disconnect an RC queue pair. Sends a DREQ to the
    /// remote peer and transitions the QP to the SQD (Send Queue Drained)
    /// and then ERROR state. Outstanding work requests are flushed with
    /// error completions. After disconnect, the QP can be destroyed.
    pub disconnect_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
    ) -> IoResultCode,

    /// Register a callback for incoming connection requests on a listening QP.
    /// The callback receives a `request_id` that can be passed to `accept_conn`
    /// or ignored (to reject the connection).
    pub set_conn_callback: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        callback: unsafe extern "C" fn(
            qp: RdmaQpHandle,
            request_id: u64,
            remote_params: *const RdmaConnParams,
        ),
    ) -> IoResultCode>,
}

/// Connection parameters for RC queue pair connection management.
#[repr(C)]
pub struct RdmaConnParams {
    /// Remote QP number (24 bits, upper 8 bits reserved).
    pub remote_qpn: u32,
    /// Remote LID (local identifier, for InfiniBand subnets).
    pub remote_lid: u16,
    /// Service level (QoS, 0-15).
    pub service_level: u8,
    /// Global routing: remote GID (for RoCE and cross-subnet IB).
    pub remote_gid: [u8; 16],
    /// Explicit padding: `remote_gid` ends at offset 23; `path_mtu` (u32)
    /// needs 4-byte alignment, so 1 byte of padding aligns it to offset 24.
    pub _pad1: u8,
    /// Path MTU (256, 512, 1024, 2048, 4096 bytes).
    pub path_mtu: u32,
    /// Retry count for RC transport (0-7).
    pub retry_count: u8,
    /// RNR (Receiver Not Ready) retry count (0-7).
    pub rnr_retry: u8,
    /// Private data for application-level connection negotiation (up to 56 bytes).
    pub private_data: [u8; 56],
    /// Length of valid private data bytes.
    pub private_data_len: u8,
    pub _pad: [u8; 3],
}

46.1.4 Topology Export

The kernel exports accelerator and network topology so that collective libraries (NCCL, RCCL, Gloo, etc.) can make optimal routing decisions:

/sys/kernel/isle/topology/
    accelerators        # List of all accelerators with NUMA/PCIe location
    p2p_matrix          # NxN matrix of P2P bandwidth between accelerators
    rdma_links          # RDMA network links with bandwidth/latency
    collective_groups   # Pre-computed optimal collective groups
$ cat /sys/kernel/isle/topology/p2p_matrix
# Bandwidth in GB/s (0 = no direct path)
#       GPU0    GPU1    GPU2    GPU3
GPU0    -       200     12.5    12.5
GPU1    200     -       12.5    12.5
GPU2    12.5    12.5    -       200
GPU3    12.5    12.5    200     -

This is additive — NCCL currently discovers topology through a combination of nvidia-smi topo, sysfs traversal, and trial-and-error. A kernel-provided topology file is faster and more reliable.


46.2 Linux Compatibility Layer

46.2.1 DRM/KMS Compatibility

Linux's Direct Rendering Manager (DRM) is the standard GPU kernel interface. Userspace graphics stacks (Mesa, Vulkan, Wayland compositors) use it via /dev/dri/card* and /dev/dri/renderD*.

ISLE provides DRM compatibility through isle-compat:

Userspace (Vulkan, OpenGL, CUDA, etc.)
    |
    | Standard DRM/KMS ioctls
    v
isle-compat/src/drm/
    |
    | Translates DRM ioctls to isle-accel KABI calls
    v
isle-accel scheduler + AccelBase/AccelCompute vtable
    |
    | KABI vtable calls
    v
GPU driver (Tier 1, domain-isolated)

DRM ioctls translated to isle-accel operations:

DRM ioctl isle-accel equivalent
DRM_IOCTL_MODE_* Display subsystem (AccelDisplayVTable, not covered here)
DRM_IOCTL_*_CTX_CREATE/DESTROY create_context / destroy_context
DRM_IOCTL_*_GEM_CREATE alloc_device_memory
DRM_IOCTL_*_GEM_MMAP map_device_memory
DRM_IOCTL_*_EXECBUFFER submit_commands (via AccelScheduler)
DRM_IOCTL_*_WAIT poll_completion
DRM_IOCTL_PRIME_* P2P DMA handle export/import

46.2.2 NVIDIA Compatibility

NVIDIA's userspace stack (CUDA runtime, cuDNN, TensorRT) communicates with the kernel driver through proprietary ioctls on /dev/nvidia*. The compatibility approach:

Option A: NVIDIA ships a KABI-native kernel interface layer
  - NVIDIA already has a clean internal split between their proprietary
    compute core and the "kernel interface layer" (nvidia.ko)
  - ISLE provides a KABI implementation of this interface layer
  - NVIDIA's compute core links against our KABI implementation
  - Same approach described in Section 55.4

Option B: ioctl compatibility shim
  - Translate NVIDIA's /dev/nvidia* ioctls to isle-accel KABI calls
  - More fragile (NVIDIA changes their ioctl interface between releases)
  - Stopgap until Option A is available

Option A is strongly preferred and is the medium-term plan.

46.2.3 Feasibility Analysis: Porting Open-Source NVIDIA Drivers

NVIDIA open-sourced their kernel modules (nvidia.ko, nvidia-uvm.ko, nvidia-modeset.ko) in 2022 under dual MIT/GPLv2 license. This section analyzes the feasibility of writing a new ISLE-native driver based on this open-source code, preserving binary compatibility with NVIDIA's proprietary userspace stack (CUDA, cuDNN, TensorRT, etc.).

Why this matters: If unmodified libcuda.so, libnvidia-ml.so, cuDNN, TensorRT, and the entire CUDA toolkit work on ISLE without recompilation, the kernel is immediately viable for the entire GPU computing ecosystem.

NVIDIA's Architecture (Post-2022 Open-Source Release)

The modern NVIDIA driver stack has a clean three-layer architecture:

Layer 3: Proprietary Userspace (binary-only, NOT recompiled)
  ┌─────────────────────────────────────────────────────┐
  │  libcuda.so (CUDA runtime)                          │
  │  libnvidia-ml.so (NVML monitoring)                  │
  │  cuDNN, TensorRT, NCCL                              │
  │  Vulkan/OpenGL ICD (libnvidia-glcore.so)            │
  │  NVENC/NVDEC (video encode/decode)                  │
  └──────────────────────┬──────────────────────────────┘
                         │ ioctl() on /dev/nvidia*
                         │ (stable-ish interface, versioned by driver release)
                         v
Layer 2: Open-Source Kernel Module (MIT/GPLv2, THIS IS WHAT WE PORT)
  ┌─────────────────────────────────────────────────────┐
  │  nvidia.ko                                          │
  │  ├── OS interface layer (nv-linux.c, nv-pat.c,     │
  │  │   nv-mmap.c, nv-i2c.c, nv-acpi.c, etc.)        │
  │  │   → Linux kernel API calls (PCI, DMA, IRQ, MM)  │
  │  │   → THIS is what we rewrite for ISLE       │
  │  │                                                  │
  │  ├── RM (Resource Manager) core                     │
  │  │   → Hardware-agnostic resource management        │
  │  │   → Talks DOWN to GSP firmware via RM RPC        │
  │  │   → Talks UP to userspace via control ioctls     │
  │  │   → We can reuse this largely unchanged          │
  │  │                                                  │
  │  └── Entry points: module_init/exit, ioctl dispatch │
  │      → Rewrite for KABI driver lifecycle             │
  │                                                     │
  │  nvidia-uvm.ko (Unified Virtual Memory)             │
  │  ├── Page fault handler, migration engine           │
  │  ├── Uses Linux HMM (mmu_notifiers)                 │
  │  └── Integrate with isle-core HMM (Section 43)       │
  │                                                     │
  │  nvidia-modeset.ko (Display/KMS)                    │
  │  ├── Mode setting, display management               │
  │  └── Map to AccelDisplayVTable                       │
  └──────────────────────┬──────────────────────────────┘
                         │ RM RPC (register read/write commands)
                         v
Layer 1: GSP Firmware (runs ON the GPU, opaque, NOT our concern)
  ┌─────────────────────────────────────────────────────┐
  │  GPU System Processor firmware                      │
  │  - Loaded by kernel module during initialization    │
  │  - Handles actual hardware programming              │
  │  - Context switching, power management, ECC, etc.   │
  │  - Communicates with kernel via shared memory + IRQ │
  │  - Binary blob, but runs on GPU, not on CPU         │
  │  - We just need to load it and talk RM RPC to it    │
  └─────────────────────────────────────────────────────┘

Key insight: nvidia.ko is NOT a traditional monolithic driver. On modern GPUs (Turing and newer, i.e., everything from 2018+), the kernel module is primarily a thin translation layer between Linux kernel APIs and the GSP firmware running on the GPU itself. The "intelligence" is in the GSP firmware, not in the kernel module.

Component-by-Component Porting Analysis

Component 1: OS Interface Layer → KABI Translation (Mechanical, ~60% of work)

Linux API Used ISLE Equivalent Difficulty
pci_register_driver() KABI device registry match + register_driver() Trivial
request_irq() / free_irq() KABI request_irq() / free_irq() Trivial
dma_alloc_coherent() / dma_map_*() KABI dma_alloc() / dma_map() Trivial
ioremap() / iounmap() KABI ioremap() / iounmap() Trivial
alloc_pages() / __free_pages() KABI alloc_pages() Trivial
mutex_lock() / spin_lock() KABI mutex / spinlock (or Rust equivalents) Trivial
timer_setup() / schedule_work() KABI timer / workqueue equivalents Easy
pci_enable_msi() / MSI-X KABI MSI/MSI-X support Easy
sysfs_create_group() KABI property export (device registry) Easy
ACPI methods (power management) KABI power state callbacks Easy
vm_area_struct / vm_operations KABI memory mapping interface Medium
mmu_notifier_*() (for UVM) ISLE HMM PageLocationTracker (Section 43) Medium
/proc/driver/nvidia/ /sys/kernel/isle/accel/ (Section 46.2.5) Easy

This is ~90+ files in the open-source nvidia.ko that implement Linux kernel API calls. The work is mechanical: each Linux API call maps to its KABI equivalent. There are no algorithmic decisions — it's pure API translation.

Component 2: ioctl Interface → /dev/nvidia* Compatibility (Critical)

The binary userspace communicates exclusively through ioctls on: - /dev/nvidia0, /dev/nvidia1, ... (per-GPU control) - /dev/nvidiactl (system-level control) - /dev/nvidia-uvm (unified virtual memory) - /dev/nvidia-uvm-tools (profiling) - /dev/nvidia-modeset (display)

Approach:
  isle-compat provides /dev/nvidia* character devices.
  Each ioctl is dispatched to the ISLE NVIDIA driver (Tier 1).
  The driver's RM core processes the ioctl exactly as it does on Linux.

  The critical point: we do NOT reinterpret ioctls.
  We pass them through to the same RM core code (open-source).
  The RM core talks to GSP firmware, which does the actual work.
  The ioctl ABI is preserved byte-for-byte.

The ioctl numbers and structures are defined in NVIDIA's open-source headers (nvidia-uvm/uvm_linux_ioctl.h, nvidia/nv-ioctl-numbers.h). Since the RM core is open-source and handles ioctl dispatch internally, we preserve the exact same ioctl interface. Binary libcuda.so cannot distinguish ISLE from Linux.

Component 3: UVM (Unified Virtual Memory) → ISLE HMM Integration (Hardest Part)

nvidia-uvm.ko implements CUDA Unified Memory. On Linux, it hooks deeply into the memory management subsystem:

Linux UVM hooks:
  - mmu_notifier (track CPU page table changes)
  - hmm_range_fault() (resolve CPU page faults for GPU access)
  - migrate_vma() (migrate pages between CPU and GPU)
  - Fault handler for GPU page faults (ATS/PRI)

ISLE's advantage: Section 43 (Heterogeneous Memory Management) provides exactly these primitives as first-class kernel features. The mapping is:

Linux UVM mechanism ISLE equivalent
mmu_notifier_register() PageLocationTracker subscription
hmm_range_fault() ISLE HMM handle_device_fault() callback
migrate_vma_setup/pages/finalize() AccelBase migrate_pages()
fault_handler() for ATS ISLE device fault handler (Section 43.1.4)
Per-process GPU page tables AccelContext memory management
UVM counters / access tracking ISLE memory accounting (cgroup accel.memory.*)

This is the most complex component because UVM deeply interweaves with memory management internals. However, ISLE's HMM design (Section 43) was designed with exactly this use case in mind, so the mapping is cleaner than on Linux.

Component 4: GSP Firmware Loading and Communication (Straightforward)

The kernel module loads GSP firmware from the filesystem (/lib/firmware/nvidia/) into GPU VRAM, then communicates via shared memory regions and interrupts. This is self-contained within the RM core and does not depend on Linux APIs beyond basic DMA and interrupt handling — both of which map trivially to KABI.

Component 5: Display / KMS (nvidia-modeset.ko) (Standard)

Maps to AccelDisplayVTable. Standard DRM/KMS translation (Section 46.2.1) handles most of this. NVIDIA's modeset module is relatively thin compared to the compute path.

Difficulty Rating Summary
Component Difficulty Notes
OS interface layer rewrite 3/10 Mechanical API translation
ioctl passthrough (/dev/nvidia*) 2/10 Character device + dispatch
GSP firmware loading 2/10 Self-contained in RM core
PCI/DMA/IRQ setup 2/10 Trivial KABI mapping
UVM → ISLE HMM integration 7/10 Deepest integration point
Display / modeset 4/10 Standard DRM/KMS path
Power management (ACPI/PCIe) 3/10 Device registry PM callbacks
Testing + binary userspace validation 5/10 Must test full CUDA stack
Binary Userspace Compatibility Verification

The following must work without recompilation on ISLE:

Component Interface Used Compatibility Path
libcuda.so (CUDA runtime) /dev/nvidia* ioctls ioctl passthrough to RM core
libnvidia-ml.so (NVML) /dev/nvidiactl ioctls + sysfs ioctl passthrough + sysfs compat
cuDNN / TensorRT Links against libcuda.so Transitive (libcuda works → these work)
nvidia-smi NVML library Transitive
NCCL (multi-GPU) libcuda + /dev/nvidia-uvm UVM compat + P2P DMA support
Vulkan ICD DRM + /dev/nvidia* DRM compat (8.1) + ioctl passthrough
NVENC/NVDEC /dev/nvidia* ioctls ioctl passthrough
Container runtime (nvidia-container-toolkit) cgroup + device files cgroup compat + device file compat
What ISLE Does Better Than Linux for NVIDIA GPUs
Capability Linux Behavior ISLE Behavior
GPU crash recovery System reboot required Driver reload in ~100-500ms (Section 42.3.2)
GPU scheduling Driver-internal, invisible Kernel-managed, cgroup-integrated (Section 42.2.4)
GPU memory limits None (driver tracks, no enforcement) cgroup accel.memory.max (Section 44.2)
GPU compute QoS None cgroup accel.compute.guarantee (Section 44.4)
GPU memory in OOM killer Invisible Full visibility, OOM-killable (Section 44.3)
Multi-tenant isolation MIG only (hardware-dependent) Software scheduling + MIG (Section 44)
GPU observability nvidia-smi (polling) Stable tracepoints + eBPF (Section 42.3.4)
UVM performance Bolted-on HMM, driver-specific First-class HMM, kernel-managed (Section 43)
P2P DMA (GPUDirect) NVIDIA-specific API Generalized KABI P2P (Section 43)
Power management Driver-internal Topology-driven, device-registry-integrated
Risk Assessment
Risk Likelihood Impact Mitigation
NVIDIA changes ioctl ABI between releases Medium High Pin to specific driver release initially; NVIDIA's ioctl ABI is versioned and backwards-compatible in practice
UVM integration bugs cause data corruption Low Critical Extensive testing with CUDA memory stress tests; UVM has well-defined semantics
GSP firmware version incompatibility Low High Support specific firmware versions matching the open-source driver release
Performance regression vs Linux Medium Medium Profile early; most performance is in GSP firmware and userspace, not kernel module
NVIDIA open-source license compliance None N/A MIT/GPLv2 dual license; our OKLF is compatible with both
Implementation Strategy
Phase 1: Basic GPU Initialization
  - Port OS interface layer to KABI
  - GSP firmware loading
  - /dev/nvidia* character devices with ioctl passthrough
  - PCI, DMA, IRQ setup via KABI
  - Goal: nvidia-smi shows GPU info

Phase 2: Compute Workloads
  - Full RM core integration
  - CUDA simple programs work (vector add, matmul)
  - Basic context management through AccelBase
  - Goal: CUDA samples compile and run

Phase 3: UVM and Multi-GPU
  - UVM integration with ISLE HMM
  - Unified memory (cudaMallocManaged) works
  - P2P DMA for multi-GPU (NVLink, PCIe)
  - NCCL multi-GPU collectives
  - Goal: PyTorch distributed training works

Phase 4: Production Hardening
  - Display / modeset (AccelDisplayVTable)
  - Full cgroup integration
  - Crash recovery testing
  - Performance parity validation
  - Goal: Production-ready for datacenter + desktop
Licensing Note

NVIDIA's open-source kernel modules are dual-licensed MIT/GPLv2. Since we are writing a new driver inspired by the open-source code (not copy-pasting it), and our KABI interface layer is original work, there are no licensing concerns. The ioctl numbers and structures are functional interfaces (facts, not copyrightable expression). The GSP firmware is a binary blob loaded onto the GPU (not linked into our kernel). This is fully compatible with OKLF v1.3.

46.2.4 VFIO Passthrough

For VMs that need direct device access. VFIO is a general-purpose mechanism (see Section 7.3.8) — it works identically for GPUs, NICs, NVMe controllers, and any other PCIe device. The GPU-specific example:

/dev/vfio/ interface (Section 6, Tier 2 driver path)
    |
    v
isle-kvm (KABI Tier 1 driver)
    |
    | Programs IOMMU for VM isolation
    v
PCIe device (entire device assigned to VM, VM's guest driver manages it)

VFIO passthrough works unchanged from Linux. The device is assigned to the VM at the IOMMU group level (Section 7.3.8).

46.2.5 ISLE-Specific Interfaces (Superset)

Beyond Linux compatibility, new interfaces for ISLE-aware software:

/dev/isle-accel-0           # ISLE-native accelerator access
/dev/isle-accel-1           # (one per accelerator device)

/sys/kernel/isle/accel/
    devices/
        0/
            info            # AccelDeviceInfo (JSON-formatted)
            utilization     # Current utilization stats
            contexts        # Active context list
            memory          # Memory usage breakdown
            power           # Power/thermal/clock state
            topology        # PCIe/NVLink/xGMI connections
            partitions/     # MIG/SR-IOV partitions (if available)
                0/
                    info
                    bind    # Bind to cgroup
    scheduler/
        policy              # Global scheduling policy
        stats               # Global scheduling statistics
    topology/
        p2p_matrix          # Bandwidth matrix
        rdma_links          # Network links
        numa_map            # Accelerator-to-NUMA mapping
    inference/
        models/             # In-kernel model management (Section 45)

Existing Linux tools (nvidia-smi, rocm-smi, intel_gpu_top) continue to work through the DRM/sysfs compatibility layer. ISLE-specific tools (isle-accel-top, isle-gpu-smi) can use the richer /sys/kernel/isle/accel/ interface for more detailed information and control.

46.2.6 Display Stack: Wayland and Buffer Sharing

The modern Linux display stack centers on Wayland compositors consuming DRM/KMS (Section 46.2.1). ISLE provides the full kernel infrastructure these compositors require, with crash-recovery advantages that Linux cannot offer.

DMA-BUF (Cross-Device Buffer Sharing)

DMA-BUF is the kernel primitive for sharing memory buffers between devices — GPU to display controller, GPU to video encoder, camera to GPU, GPU to network (RDMA). In Linux, DMA-BUF is a global namespace with weak access control. In ISLE, each DMA-BUF is a capability-protected kernel object:

DMA-BUF lifecycle:
  1. Exporter (e.g., GPU driver) creates a DMA-BUF from device memory
  2. Returns a capability token (not a raw file descriptor)
  3. Importer (e.g., display driver) receives the capability via IPC
  4. Importer maps the DMA-BUF into its device's address space
  5. Zero-copy path: both devices access the same physical memory

For Linux compatibility, DMA-BUF capabilities are presented as file descriptors through the compat layer, so existing userspace (Mesa, Wayland compositors, GStreamer) works unmodified.

Explicit Synchronization (dma_fence / sync_file)

GPU and display operations are asynchronous. Explicit sync primitives coordinate them:

  • dma_fence: kernel-internal synchronization point representing GPU work completion. Created by the GPU driver when work is submitted, signaled when the GPU finishes.
  • sync_file: userspace-visible wrapper around one or more dma_fences. Exported as a file descriptor. Wayland compositors use these to know when a client's rendering is complete before presenting.
  • Timeline semaphores: Vulkan-style monotonic sync objects. More efficient than binary fences for pipelined workloads (render frame N while displaying frame N-1).

ISLE implements the Linux-compatible sync_file ioctl interface (SYNC_IOC_MERGE, SYNC_IOC_FILE_INFO) so existing Wayland compositors and Vulkan drivers work without modification.

GBM (Generic Buffer Manager)

GBM is the userspace buffer allocation library that Wayland compositors use to allocate scanout-capable buffers. It talks to DRM via ioctl. ISLE's DRM compatibility layer (Section 46.2.1) ensures GBM works unmodified — gbm_create_device(), gbm_bo_create(), and related functions issue standard DRM ioctls that ISLE handles.

Render Nodes (/dev/dri/renderD*)

Render nodes provide unprivileged GPU compute and render access without modesetting capability. Any user can open a render node to submit GPU work (3D rendering, compute shaders, video encode/decode) without needing root or DRM master status. ISLE exposes render nodes via the DRM compat layer, mapping each to an isle-accel context creation with appropriate capability restrictions (no modesetting, no display control).

DRM Leases

DRM leases allow a client to take exclusive control of a display connector — the primary consumer is VR headsets (SteamVR uses DRM leases to drive the HMD display independently from the desktop compositor). ISLE supports the lease ioctl family (DRM_IOCTL_MODE_CREATE_LEASE, DRM_IOCTL_MODE_LIST_LESSEES, etc.) through the DRM compatibility layer.

KMS Atomic Modesetting

The modern display pipeline API. Replaces legacy drmModeSetCrtc with atomic commits that update multiple display properties (resolution, refresh, gamma, position) as a single transaction — either the entire commit succeeds or nothing changes. ISLE's DRM layer implements the full atomic commit interface (DRM_IOCTL_MODE_ATOMIC) including: - TEST_ONLY mode (compositor can test configurations without applying them) - Non-blocking commits (compositor doesn't stall waiting for vblank) - Property-based interface (all display state expressed as properties)

Multi-GPU Display (PRIME)

Render on GPU A, scanout on GPU B via DMA-BUF sharing (known as PRIME in the DRM ecosystem). The P2P DMA infrastructure (Section 43) handles the underlying data transfer. For the display case specifically: - DMA-BUF export from render GPU → DMA-BUF import on display GPU - If GPUs share a PCIe switch, transfer is direct P2P (no CPU copy) - If GPUs are on different NUMA nodes, transfer goes through host memory

HDR and Wide Color Gamut

Modern displays support HDR10, Dolby Vision, and wide color gamuts (DCI-P3, Rec. 2020). Kernel support is exposed via KMS atomic properties: - HDR_OUTPUT_METADATA: HDR static/dynamic metadata per connector - COLORSPACE: signal color space to the display (BT.709, DCI-P3, BT.2020) - MAX_BPC: maximum bits-per-channel for the connector (8, 10, 12, 16) - COLOR_ENCODING / COLOR_RANGE: YCbCr encoding and quantization range

These are standard KMS properties exposed through atomic modesetting. Wayland compositors (wlroots, Mutter, KWin) use these to negotiate HDR output with the display.

Crash Recovery Advantage

If the DRM/GPU driver crashes (Tier 1 reload, Section 9):

  1. The driver binary is reloaded in ~50-150ms
  2. DMA-BUF metadata (size, format, sharing graph, capability tokens) survives the reload — this metadata is managed by isle-core's memory subsystem, not the driver. Buffer handles remain valid. However, VRAM contents are lost on device reset — a GPU reset physically reinitializes the device memory, destroying all framebuffer data, texture contents, and compute buffers stored in VRAM. Applications must re-upload data to the GPU after recovery.
  3. Sync objects are re-created in signaled state (conservative — forces resubmission of any pending work, but doesn't deadlock the compositor)
  4. The Wayland compositor sees a brief stall (~100-500ms, the full recovery window) and must re-render its buffers (VRAM contents are lost), but does not lose its buffer handles or capability tokens, does not crash, and does not need to re-negotiate with clients. The compositor's CPU-side state (window tree, client list, layout) is fully preserved.
  5. Running applications still hold valid DMA-BUF capabilities. They receive an ENODATA error on the next access to a VRAM-backed buffer, indicating that the buffer contents are stale and must be re-uploaded. Applications that maintain CPU-side copies of their rendering data (the common case for double-buffered Wayland clients) can re-upload and resume rendering immediately. Applications that relied solely on VRAM-resident data with no CPU-side copy must regenerate or reload from disk.

In Linux, a GPU driver crash typically kills the entire X/Wayland session, requiring all graphical applications to restart. ISLE's display stack crash recovery preserves the sharing graph and buffer metadata, allowing rapid re-rendering rather than full session teardown. This is a direct consequence of the capability-managed DMA-BUF design and Tier 1 driver reload.