Sections 42–46 of the ISLE Architecture. For the full table of contents, see README.md.
Part XI: AI/ML and Accelerators
Unified accelerator framework, heterogeneous memory, isolation, in-kernel inference, and GPU compatibility.
42. Unified Accelerator Framework
This section defines kernel-level AI/ML infrastructure: unified accelerator management, heterogeneous memory, peer-to-peer DMA, multi-tenant isolation, in-kernel inference, and distributed ML networking. These are kernel-internal capabilities — training frameworks and model serving remain in userspace.
Design Constraints:
- Drop-in compatibility: Existing Linux GPU/accelerator userspace (CUDA runtime,
ROCm, OpenCL, Vulkan compute,
/dev/dri,/dev/nvidia*) must work through the compatibility layer. Existing tools must not break. - Superset, not replacement: ISLE exposes additional accelerator management capabilities beyond what Linux provides. Userspace software that wants better scheduling, isolation, or memory management can opt in. Software that doesn't care sees standard Linux behavior.
- IP-clean: Built from public specifications (PCIe, VFIO, DRM/KMS interfaces), academic algorithms, and open standards. No proprietary API reimplementation.
42.1 Motivation
42.1.1 The Current State
In 2026, accelerators (GPUs, NPUs, TPUs, FPGAs, custom ASICs) are as fundamental to computing as network interfaces. Every cloud instance, every phone, every laptop has at least one. AI/ML workloads are the dominant growth driver in datacenter compute.
Yet the Linux kernel treats accelerators as dumb peripherals. The kernel's relationship to a GPU is roughly:
Linux kernel:
- Maps PCI BARs (MMIO)
- Sets up IOMMU
- Routes interrupts
- That's it. The driver owns everything else.
GPU driver (e.g., nvidia.ko):
- Owns the command queue
- Owns memory management (VRAM allocation, page tables)
- Owns scheduling (which process runs on which compute unit)
- Owns isolation (who can see whose memory)
- Owns power management (GPU clock, thermal)
- The kernel knows nothing about any of this.
This is equivalent to how operating systems managed CPUs before preemptive multitasking — the hardware vendor's runtime owns the resource, the kernel is blind.
42.1.2 What This Causes
| Problem | Impact |
|---|---|
| No kernel-visible GPU scheduling | Two containers sharing a GPU have no fairness guarantees. One can starve the other. |
| No cgroup integration | cpu.max limits CPU time. Nothing limits GPU time. Kubernetes has no way to enforce GPU QoS. |
| No memory accounting | GPU VRAM usage is invisible to the OOM killer. A GPU memory leak crashes the GPU driver, not just the leaking process. |
| No preemption | A long-running GPU kernel (training batch) blocks interactive inference. Linux has solved this for CPUs since 1991. |
| No unified memory management | CUDA UVM, AMD SVM, Intel SVM — each vendor's "unified memory" is driver-internal. The kernel's page fault handler doesn't know about GPU page tables. |
| No isolation | GPU memory between processes is isolated by the driver, not the kernel. Driver bugs = security holes. |
| Driver crash = system crash | NVIDIA driver hang requires full system reboot. No crash recovery. |
| No P2P DMA management | GPUDirect is vendor-specific. No kernel facility for "DMA from NVMe directly to GPU VRAM." |
42.1.3 The Opportunity
A kernel designed from scratch in 2026 can treat accelerators as first-class scheduled resources, the same way it treats CPUs, memory, and I/O bandwidth. This is not about running PyTorch in the kernel — it is about the kernel managing accelerator hardware the way it manages all other hardware: scheduling, memory, isolation, recovery.
ISLE's existing architecture is uniquely suited for this: - KABI: Stable driver ABI means GPU driver updates are decoupled from kernel updates - Crash recovery: GPU driver crashes can be survived (driver binary reload in ~50-150ms; total recovery including GPU hardware reset: ~100-500ms) - Capability system: Natural fit for fine-grained accelerator access control - Device registry: Models accelerator topology (GPU → engines, VRAM, display, encode) - cgroup integration: CPU bandwidth guarantees (Section 15) extend naturally to accelerator time - Zero-copy I/O paths: Generalize to device-to-device DMA
42.2 Unified Accelerator Framework
42.2.1 Design: isle-accel
A new KABI interface family for accelerator devices. Every accelerator driver — GPU, NPU, TPU, FPGA, DSP, custom ASIC — implements the same base interface. Hardware-specific capabilities are exposed through versioned extension vtables.
isle-accel (KABI interface)
|
+---------------+---------------+
| | |
AccelBase AccelCompute AccelDisplay
(all accel) (compute- (display/render
capable) capable)
| | |
+-------+-----+ +---+---+ +---+---+
| | | | | | |
nvidia amdgpu xe nvidia amdgpu nvidia amdgpu
GPU GPU GPU GPU GPU GPU GPU
|
intel_npu
|
custom_tpu
42.2.2 AccelBaseVTable — The Universal Interface
Every accelerator driver provides this vtable. It covers what the kernel needs to manage the device, not what userspace needs to program it.
/// Base vtable that every accelerator driver must implement.
/// This is the kernel-management interface, not the compute API.
#[repr(C)]
pub struct AccelBaseVTable {
/// Total size of this vtable struct in bytes. Allows the kernel to detect
/// when a driver provides a newer (larger) vtable than expected and safely
/// ignore trailing fields it does not understand.
pub vtable_size: u64,
/// KABI version encoded as (major << 16) | minor.
/// - **Major version** (upper 16 bits): incremented for breaking changes
/// (removed fields, changed semantics, reordered vtable entries).
/// Kernel and driver must agree on the major version or registration fails.
/// - **Minor version** (lower 16 bits): incremented for backward-compatible
/// additions (new Optional fields appended to the vtable, new enum variants).
/// A driver with minor version N works with a kernel expecting minor <= N.
///
/// **Negotiation protocol during driver registration:**
/// 1. Driver calls `register_driver()` with its vtable (including `version`
/// and `vtable_size`).
/// 2. Kernel extracts the major version: `driver_major = version >> 16`.
/// 3. If `driver_major != kernel_major`, registration fails with
/// `KABI_VERSION_MISMATCH`. The driver must be rebuilt against a
/// compatible kernel header.
/// 4. If `driver_major == kernel_major`, the kernel accepts the vtable.
/// Any `Option<fn>` fields beyond the driver's `vtable_size` are treated
/// as `None` (not present). Any required (non-Option) fields beyond the
/// driver's `vtable_size` cause registration failure.
/// 5. The negotiated version is recorded in the device registry node for
/// diagnostic purposes.
///
/// This protocol applies identically to AccelComputeVTable,
/// AccelDisplayVTable, RdmaDeviceVTable, and all other KABI vtables
/// in this section.
pub version: u32,
// === Device Info ===
/// Return device capabilities and resource inventory.
pub get_info: unsafe extern "C" fn(
ctx: *mut c_void,
out_info: *mut AccelDeviceInfo,
) -> IoResultCode,
// === Context Management ===
/// Create an execution context (one per process/tenant).
/// Returns an opaque context handle.
pub create_context: unsafe extern "C" fn(
ctx: *mut c_void,
owner_pid: u32,
priority: AccelPriority,
limits: *const AccelContextLimits,
out_context: *mut AccelContextHandle,
) -> IoResultCode,
/// Destroy an execution context and free all its resources.
pub destroy_context: unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
) -> IoResultCode,
// === Command Submission ===
/// Submit a command buffer for execution.
/// The kernel calls this after validating capabilities and
/// scheduling the submission according to policy.
pub submit_commands: unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
cmd_buffer: *const u8,
cmd_len: u32,
fences: *const AccelFence,
fence_count: u32,
out_submission: *mut AccelSubmissionHandle,
) -> IoResultCode,
/// Poll for completion of a submitted command buffer.
pub poll_completion: unsafe extern "C" fn(
ctx: *mut c_void,
submission: AccelSubmissionHandle,
out_status: *mut AccelCompletionStatus,
) -> IoResultCode,
/// Request preemption or cooperative yield of a running context.
///
/// Behavior depends on `AccelDeviceInfo::preemption_granularity`:
/// - `Instruction` or `DrawDispatch`: True preemption. The device saves
/// context state and stops the running workload within ~50μs-10ms
/// (depends on preemption granularity and workload; instruction-level
/// preemption on modern compute GPUs is typically 50-100μs, while
/// draw/dispatch-level preemption can take milliseconds).
/// The context can be resumed later via a new `submit_commands`.
/// - `CommandBuffer`: The device finishes the current command buffer
/// but does not start the next queued one. Latency is bounded by
/// the longest in-flight command buffer.
/// - `None`: Cooperative yield only. The driver stops submitting new
/// command buffers after the current one completes. Cannot interrupt
/// a running dispatch.
///
/// Returns `IO_OK` if the preemption/yield was initiated (completion
/// is asynchronous — the scheduler polls for the context to become
/// idle). Returns `IO_NOT_SUPPORTED` if the driver does not implement
/// any form of preemption or yield.
pub preempt_context: Option<unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
reason: PreemptReason,
) -> IoResultCode>,
// === Memory Management ===
/// Allocate device-local memory (VRAM, local SRAM, etc.).
pub alloc_device_memory: unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
size: u64,
alignment: u64,
page_size: AccelPageSize,
flags: AccelMemFlags,
out_handle: *mut AccelMemHandle,
) -> IoResultCode,
/// Free device-local memory.
pub free_device_memory: unsafe extern "C" fn(
ctx: *mut c_void,
handle: AccelMemHandle,
) -> IoResultCode,
/// Map device memory into a CPU-visible virtual address.
pub map_device_memory: unsafe extern "C" fn(
ctx: *mut c_void,
handle: AccelMemHandle,
offset: u64,
size: u64,
out_cpu_addr: *mut u64,
) -> IoResultCode,
/// Migrate pages between CPU RAM and device memory.
/// Direction determined by flags.
pub migrate_pages: Option<unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
pages: *const AccelMigrationEntry,
page_count: u32,
flags: MigrationFlags,
) -> IoResultCode>,
// === Utilization Reporting ===
/// Report current device utilization to the kernel scheduler.
pub get_utilization: unsafe extern "C" fn(
ctx: *mut c_void,
out_util: *mut AccelUtilization,
) -> IoResultCode,
// === Power/Thermal ===
/// Get current power and thermal state.
pub get_power_state: unsafe extern "C" fn(
ctx: *mut c_void,
out_state: *mut AccelPowerState,
) -> IoResultCode,
/// Set performance level / clock frequency.
pub set_performance_level: Option<unsafe extern "C" fn(
ctx: *mut c_void,
level: AccelPerfLevel,
) -> IoResultCode>,
// === Reset ===
/// Reset a single execution context (not the whole device).
pub reset_context: unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
) -> IoResultCode,
/// Full device reset.
pub reset_device: unsafe extern "C" fn(
ctx: *mut c_void,
) -> IoResultCode,
// === Completion Notification ===
/// Register a kernel callback for command completion notification.
/// Replaces polling for latency-sensitive contexts.
pub register_completion_callback: Option<unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
callback: unsafe extern "C" fn(
context: AccelContextHandle,
submission: AccelSubmissionHandle,
status: AccelCompletionStatus,
),
) -> IoResultCode>,
}
Callback restrictions: Completion callbacks execute in interrupt context (hardirq
on x86, IRQ on ARM). They MUST NOT:
- Allocate memory (no Box, Vec, or slab allocation)
- Acquire sleeping locks (no Mutex, only SpinLock with try_lock)
- Perform I/O (no disk, network, or MMIO beyond the accelerator's own registers)
- Call schedule() or any function that may sleep
Callbacks that need to perform complex work (e.g., chaining dependent dispatches,
updating shared data structures) must defer to a per-accelerator workqueue thread
via accel_defer(closure). The workqueue runs in process context with full kernel
capabilities. The completion callback's job is limited to: (1) recording completion
status, (2) waking the workqueue if deferred work is pending, and (3) updating
per-context statistics (fence value, timing).
42.2.3 Key Data Types
/// Device capability and resource inventory.
#[repr(C)]
pub struct AccelDeviceInfo {
/// Device class.
pub device_class: AccelDeviceClass,
/// Number of hardware compute units in this device.
/// The definition of "compute unit" is device-class-specific:
/// - GPU: Streaming Multiprocessors (SMs) for NVIDIA, Compute Units (CUs) for AMD
/// - NPU: neural processing cores
/// - FPGA: configurable logic blocks
/// - DSP: DSP cores
/// This is the count reported by the driver via `get_info()` and used by the
/// AccelScheduler for proportional resource accounting. For MIG/partitioned
/// devices, this is the compute units assigned to the partition, not the
/// full device total.
pub compute_units: u32,
/// Device-local memory size (bytes). 0 if no local memory.
pub local_memory_bytes: u64,
/// Maximum concurrent execution contexts.
pub max_contexts: u32,
/// Hardware preemption support.
pub preemption_granularity: PreemptionGranularity,
/// Maximum command buffer size (bytes).
pub max_cmd_buffer_size: u32,
/// Supported memory types.
pub memory_types: AccelMemTypeFlags,
/// PCIe atomics support (needed for fine-grained SVM).
/// Uses `u8` (0=false, 1=true) instead of `bool` for stable KABI:
/// C `bool` has implementation-defined size and alignment, making it
/// unsuitable for cross-compilation-unit ABI boundaries.
pub pcie_atomics: u8,
/// Peer-to-peer DMA capability.
/// `u8` for KABI stability (see `pcie_atomics` comment).
pub p2p_capable: u8,
/// Unified virtual addressing (CPU and device share address space).
/// `u8` for KABI stability (see `pcie_atomics` comment).
pub unified_addressing: u8,
/// Hardware page fault support (device can fault on unmapped pages).
/// `u8` for KABI stability (see `pcie_atomics` comment).
pub hw_page_faults: u8,
pub _pad: [u8; 32],
}
#[repr(u32)]
pub enum AccelDeviceClass {
/// General-purpose GPU (compute + graphics + display).
Gpu = 0,
/// Compute-only GPU (no display, e.g., datacenter SKUs).
GpuCompute = 1,
/// Neural Processing Unit (fixed-function inference).
Npu = 2,
/// Tensor Processing Unit / AI ASIC.
Tpu = 3,
/// FPGA with compute overlay.
Fpga = 4,
/// Digital Signal Processor.
Dsp = 5,
/// Media processor (video encode/decode).
MediaProcessor = 6,
/// Computational storage device with embedded compute capability.
ComputeStorage = 7,
/// Other / vendor-specific.
Other = 255,
}
#[repr(u32)]
pub enum PreemptionGranularity {
/// No preemption support. Context runs to completion.
None = 0,
/// Preempt at command buffer boundaries.
CommandBuffer = 1,
/// Preempt at draw call / dispatch boundaries.
DrawDispatch = 2,
/// Preempt at instruction level (mid-shader/kernel).
Instruction = 3,
}
/// Reason for requesting preemption of a running accelerator context.
/// Passed to `preempt_context` so the driver can log/report the cause
/// and, on devices that support it, choose an appropriate preemption
/// strategy (e.g., urgent drain vs. graceful yield).
#[repr(u32)]
pub enum PreemptReason {
/// A higher-priority submission is waiting behind this context.
/// The scheduler detected priority inversion and is preempting the
/// lower-priority context to unblock the higher-priority one.
PriorityInversion = 0,
/// The context has exceeded its CBS bandwidth budget for the current
/// scheduling period. Fairness enforcement requires yielding the
/// device so other contexts can use their guaranteed bandwidth.
FairnessTimeout = 1,
/// The context's current submission has exceeded its
/// `AccelContextLimits::max_execution_us` timeout. The scheduler is
/// preempting to enforce the per-submission execution time limit.
ExecutionTimeout = 2,
/// Administrative eviction: the context is being forcibly removed
/// from the device. Causes include driver unload, device reset,
/// cgroup removal, or process exit cleanup.
AdminEvict = 3,
}
/// Per-context resource limits (enforced by kernel).
#[repr(C)]
pub struct AccelContextLimits {
/// Maximum device memory this context can allocate (bytes).
/// 0 = no limit (subject to device capacity).
pub max_memory_bytes: u64,
/// Maximum compute time per submission (microseconds).
/// 0 = no limit. Submissions exceeding this are preempted.
pub max_execution_us: u64,
/// Bandwidth guarantee: guaranteed microseconds of compute
/// per scheduling period. See Section 44 (cgroup integration).
/// 0 = best-effort (no guarantee).
pub guaranteed_bandwidth_us: u64,
/// Scheduling period for bandwidth accounting (microseconds).
/// Default: 1_000_000 (1 second).
pub bandwidth_period_us: u64,
pub _pad: [u8; 32],
}
/// Scheduling priority for accelerator contexts.
#[repr(u32)]
pub enum AccelPriority {
/// Background / batch. Lowest priority. Preempted by all others.
Background = 0,
/// Normal interactive workload.
Normal = 1,
/// High priority (e.g., real-time inference with SLO).
High = 2,
/// Realtime. Highest priority. Preempts all others immediately.
Realtime = 3,
}
/// Entry describing a page to migrate between CPU and device memory.
/// Used by migrate_pages() in AccelBaseVTable.
#[repr(C)]
pub struct AccelMigrationEntry {
/// Virtual address of the page to migrate (must be page-aligned).
pub vaddr: u64,
/// Current location: 0 = CPU RAM, 1 = device memory.
pub current_location: u8,
/// Desired location after migration.
pub target_location: u8,
/// Result of migration (filled by driver): 0 = success, errno on failure.
pub result: i16,
pub _pad: [u8; 4],
}
bitflags! {
/// Flags controlling page migration behavior.
#[repr(C)]
pub struct MigrationFlags: u32 {
/// Migrate from CPU RAM to device memory.
const TO_DEVICE = 1 << 0;
/// Migrate from device memory to CPU RAM.
const TO_CPU = 1 << 1;
/// Force migration even if page is hot/pinned.
const FORCE = 1 << 2;
/// Asynchronous migration (return immediately, complete via callback).
const ASYNC = 1 << 3;
/// Prefetch hint (migrate pages likely to be accessed soon).
const PREFETCH = 1 << 4;
}
}
/// Device utilization report.
#[repr(C)]
pub struct AccelUtilization {
/// Compute utilization 0-100%.
pub compute_percent: u32,
/// Memory bandwidth utilization 0-100%.
pub memory_bw_percent: u32,
/// Memory usage (bytes allocated).
pub memory_used_bytes: u64,
/// Memory total (bytes).
pub memory_total_bytes: u64,
/// Number of active contexts.
pub active_contexts: u32,
/// Current temperature (millidegrees Celsius).
pub temperature_mc: u32,
/// Current power draw (milliwatts).
pub power_mw: u32,
/// Current clock frequency (MHz).
pub clock_mhz: u32,
}
// Opaque handle types (all u64 newtypes, #[repr(transparent)] for
// zero-cost FFI — the newtype has the same ABI as the inner u64).
#[repr(transparent)]
pub struct AccelContextHandle(pub u64);
#[repr(transparent)]
pub struct AccelMemHandle(pub u64);
#[repr(transparent)]
pub struct AccelSubmissionHandle(pub u64);
#[repr(transparent)]
pub struct P2pMappingHandle(pub u64);
/// Fence for synchronizing command submissions.
// The (device_id, context_id, value) tuple uniquely identifies a fence point.
// Cross-device fence signaling requires both devices to be registered in the
// same AccelFenceRegistry (Section 42.2.4).
#[repr(C)]
pub struct AccelFence {
pub fence_type: AccelFenceType,
pub value: u64,
pub device_id: DeviceNodeId, // Owning device (prevents cross-device forgery)
pub context_id: u32, // Owning context within the device
}
#[repr(u32)]
pub enum AccelFenceType {
/// Device-local fence (GPU timeline semaphore).
DeviceLocal = 0,
/// Cross-device fence (for P2P synchronization).
CrossDevice = 1,
/// CPU-signalable fence (host-side event).
CpuSignal = 2,
}
#[repr(u32)]
pub enum AccelCompletionStatus {
/// Command completed successfully.
Success = 0,
/// Command failed with device error.
Error = 1,
/// Command timed out (exceeded max_execution_us).
Timeout = 2,
/// Command was preempted by higher-priority context.
Preempted = 3,
/// Partial error: some work completed, some failed.
PartialError = 4,
}
/// Global registry for cross-device fence synchronization.
/// Devices must be registered in the same AccelFenceRegistry to share
/// cross-device fences. The registry is kernel-internal; userspace
/// interacts through /dev/isle-accel-N ioctls.
///
/// There is one global AccelFenceRegistry per accelerator topology domain
/// (typically one per NUMA node or PCIe domain). Devices in different
/// domains cannot share cross-device fences.
pub struct AccelFenceRegistry {
/// Registered devices (device_id -> fence_protocol_support).
devices: spin::RwLock<BTreeMap<DeviceNodeId, FenceProtocolSupport>>,
/// Active cross-device fences, keyed by (device_id, context_id, value).
/// The fence's signaling status is atomically updated by the device
/// that owns the fence. Devices polling on a foreign fence read from
/// this table.
fences: spin::RwLock<BTreeMap<(DeviceNodeId, u32, u64), AtomicBool>>,
/// Waiters for fence signaling (device_id -> list of waiters).
/// When a fence signals, all registered waiters are notified via
/// their registered callback.
waiters: spin::RwLock<BTreeMap<DeviceNodeId, Vec<FenceWaiterEntry>>>,
}
/// Fence protocol capabilities for a registered device.
#[repr(C)]
pub struct FenceProtocolSupport {
/// Device supports timeline semaphores (monotonically increasing value).
pub timeline_semaphores: bool,
/// Device supports cross-device signaling via hardware sync objects.
pub hw_cross_device: bool,
/// Device supports CPU-signaling of device fences (for host wait).
pub cpu_signal: bool,
/// Maximum fence value before wrap (0 = no limit).
pub max_fence_value: u64,
}
/// Entry in the fence waiter table.
struct FenceWaiterEntry {
/// Fence being waited on.
fence: AccelFence,
/// Device waiting for the fence.
waiter_device: DeviceNodeId,
/// Callback to invoke when fence signals (driver-provided).
callback: unsafe extern "C" fn(device_id: DeviceNodeId, fence: AccelFence),
}
42.2.4 Kernel-Side Accelerator Scheduler
The kernel maintains an accelerator scheduler that sits between userspace submissions
and the driver's submit_commands. This is the core innovation — the kernel sees
and controls accelerator time.
// isle-core/src/accel/scheduler.rs (kernel-internal)
/// Per-accelerator scheduler.
pub struct AccelScheduler {
/// Device this scheduler manages.
device_id: DeviceNodeId,
/// Device capabilities (cached from get_info).
device_info: AccelDeviceInfo,
/// Active contexts, ordered by priority and deadline.
/// Uses a fixed-capacity sorted array (no heap allocation).
/// Capacity is set to `device_info.max_contexts` at initialization.
contexts: FixedSortedArray<AccelContextHandle, AccelContextState>,
/// Per-context CBS bandwidth servers (same algorithm as CPU scheduler,
/// see Section 15). Pre-allocated at initialization with capacity
/// equal to `device_info.max_contexts`.
bandwidth_servers: FixedVec<AccelCbsServer>,
/// Hardware command queue depth (how many submissions are in-flight).
hw_queue_depth: AtomicU32,
/// Maximum concurrent in-flight submissions.
max_inflight: u32,
/// Scheduling policy.
policy: AccelSchedPolicy,
}
pub struct AccelContextState {
/// Owner process/cgroup.
owner_pid: u32,
cgroup_id: u64,
/// Priority class.
priority: AccelPriority,
/// Resource limits.
limits: AccelContextLimits,
/// Index into `AccelScheduler::bandwidth_servers` (if guaranteed
/// bandwidth is set). The CBS server state is owned by the scheduler's
/// `bandwidth_servers` array — storing it here as well would create
/// a consistency hazard (two copies of the same server state). The
/// scheduler uses this index to look up the context's CBS server
/// during scheduling decisions.
cbs_server_index: Option<u32>,
/// Pending command buffers waiting to be submitted to hardware.
/// Fixed-capacity ring buffer, pre-allocated at context creation.
pending_queue: FixedRingBuffer<PendingSubmission>,
/// In-flight submissions (on hardware).
/// Fixed-capacity array, sized to `max_inflight` at initialization.
inflight: FixedVec<AccelSubmissionHandle>,
/// Accounting: total compute time consumed (nanoseconds).
total_compute_ns: AtomicU64,
/// Accounting: total memory allocated (bytes).
total_memory_bytes: AtomicU64,
}
/// Pending submission waiting to be dispatched to hardware.
/// Stored in the context's `pending_queue` until the scheduler submits it.
#[repr(C)]
pub struct PendingSubmission {
/// Handle to the command buffer (driver-allocated).
pub cmd_buffer: AccelCmdBufferHandle,
/// Submission ID assigned by the driver for tracking.
pub driver_submit_id: u64,
/// Timestamp when submission was queued (for timeout detection).
pub queued_timestamp_ns: u64,
/// Number of fences this submission depends on (0 = immediate submit).
pub dependency_count: u32,
/// Semaphore to signal on completion (optional).
pub completion_semaphore: Option<AccelSemaphoreHandle>,
}
/// Handle to an in-flight submission on hardware.
/// Tracked in the context's `inflight` array until completion.
#[repr(C)]
pub struct AccelSubmissionHandle {
/// Driver-assigned submission ID (matches PendingSubmission::driver_submit_id).
pub driver_submit_id: u64,
/// Timestamp when submitted to hardware (for timeout detection).
pub submitted_timestamp_ns: u64,
/// Fence that will be signaled on completion.
pub completion_fence: AccelFenceHandle,
}
#[repr(u32)]
pub enum AccelSchedPolicy {
/// Simple round-robin between contexts. Default for NPU/FPGA.
RoundRobin = 0,
/// Priority-based with preemption. Default for GPU.
Priority = 1,
/// CBS bandwidth guarantee + priority. Default when cgroup limits set.
Guaranteed = 2,
}
/// CBS (Constant Bandwidth Server) state for per-context bandwidth enforcement.
/// Uses the same algorithm as the CPU scheduler (Section 15), adapted for
/// accelerator time. Each context with a bandwidth guarantee gets its own
/// AccelCbsServer instance in the scheduler's bandwidth_servers array.
#[repr(C)]
pub struct AccelCbsServer {
/// Context this server is associated with.
pub context_id: AccelContextHandle,
/// Maximum bandwidth (nanoseconds of accelerator time per period).
pub bandwidth_ns: u64,
/// Period in nanoseconds (e.g., 10ms = 10_000_000).
pub period_ns: u64,
/// Current runtime consumed in this period (nanoseconds).
pub runtime_consumed: AtomicU64,
/// Absolute deadline (nanoseconds since boot). Computed as
/// period_start + period_ns when the context is activated.
pub deadline: AtomicU64,
/// Start of the current period (nanoseconds since boot).
pub period_start: AtomicU64,
}
Allocation discipline: All scheduler data structures use pre-allocated, fixed-capacity
storage. No heap allocation occurs on the scheduling fast path. FixedSortedArray,
FixedVec, and FixedRingBuffer are kernel-internal types that allocate their backing
storage once at initialization (when the device is registered or a context is created)
and never resize. This ensures that submit_commands, poll_completion, and context
scheduling are deterministic-latency operations with no allocator contention.
Scheduling flow:
Userspace submits work (via /dev/isle-accel-N or DRM ioctl compat):
|
v
ISLE Core validates capabilities
- Does this process have an AccelContext for this device?
- Does this process's cgroup allow more accelerator time?
- Does this context have memory budget for the command buffer?
|
v
AccelScheduler queues the submission
- Assigns priority based on context priority + cgroup policy
- Checks CBS bandwidth server: is this context within its guarantee?
|
v
AccelScheduler picks next submission to dispatch to hardware
- Priority order: Realtime > High > Normal > Background
- Within same priority: CBS server with earliest deadline first
- If higher-priority work arrives and device supports preemption:
preempt current context via driver's preempt_context()
|
v
Driver's submit_commands() sends work to hardware
|
v
Hardware completion interrupt
|
v
Driver's poll_completion() reports result
|
v
AccelScheduler updates accounting (compute time, memory)
|
v
Userspace receives completion notification
AccelScheduler vs Firmware Scheduling Boundary:
Modern GPUs have their own internal schedulers (e.g., NVIDIA's GPC/TPC scheduler, AMD's ACE/HWS). The AccelScheduler does NOT replace or duplicate firmware scheduling. The boundary is clear:
AccelScheduler (kernel, software):
Controls WHICH contexts get to submit work and WHEN.
Enforces fairness, cgroup limits, bandwidth guarantees, priorities.
Decides the ORDER of submissions to the hardware queue.
Granularity: per-submission (command buffer level).
Firmware scheduler (device, hardware/firmware):
Controls HOW submitted work is mapped to hardware execution units.
Distributes warps/wavefronts across SMs/CUs.
Manages context switching on the device.
Handles workgroup scheduling and occupancy.
Granularity: per-instruction-group (warp/wavefront level).
Interaction:
AccelScheduler submits command buffer → firmware takes over.
The kernel does not see or control internal GPU scheduling.
The kernel CAN preempt at context boundaries (via preempt_context()),
but CANNOT preempt mid-kernel on hardware that doesn't support it.
Preemption capability is reported by get_info().preemption_granularity.
Devices with PreemptionGranularity::None do not support preemption;
the scheduler uses cooperative yield for those devices.
When the device has hardware preemption (NVIDIA compute preemption,
AMD MES): AccelScheduler can request preemption, and the firmware
will drain the current workgroup and save context. Modern GPUs
(Ampere+, CDNA2+) complete preemption in ~50μs-10ms including
context save/restore, depending on preemption granularity and
in-flight workload (instruction-level compute preemption is
typically 50-100μs; draw/dispatch-level can be milliseconds).
This is expensive compared to CPU context switches (~1μs) but
bounded. The AccelScheduler therefore avoids preemption
on the fast path (submissions are ordered by priority before dispatch)
and triggers preemption only when necessary: priority inversion
(a Realtime context is blocked behind a Background dispatch),
fairness enforcement (a context has exceeded its CBS bandwidth
budget), or timeout (max_execution_us exceeded).
Older GPUs (pre-Ampere NVIDIA, pre-CDNA2 AMD) may report
PreemptionGranularity::CommandBuffer or ::None, where preemption
latency is bounded by the longest in-flight command buffer
(potentially hundreds of milliseconds for large compute dispatches).
The scheduler treats these as cooperative-yield devices (see below).
Non-preemptible GPU limitation and cooperative yield:
Many GPUs (older hardware, NPUs, FPGAs, and some mid-range consumer GPUs) report
PreemptionGranularity::None. On these devices, the AccelScheduler CANNOT interrupt
a running workload mid-dispatch. The scheduler can only enforce time slicing at
submission boundaries — between command buffer submissions — not within a single
dispatch. This is coarse-grained time slicing.
Consequences for non-preemptible devices:
- max_execution_us is enforced at submission granularity, not instruction granularity.
A single long-running dispatch that exceeds max_execution_us cannot be aborted; the
scheduler must wait for it to complete naturally before taking corrective action.
- Time guarantees for other contexts degrade proportionally to the longest single
dispatch in the queue. Workloads with large dispatches should set small
max_cmd_buffer_size limits (enforced by the kernel at submission time) to bound
worst-case dispatch duration.
Cooperative yield mechanism for non-preemptible devices:
- The driver implements preempt_context() as a cooperative yield: after the
current command buffer completes, do not submit the next one, and signal the scheduler.
This is NOT true preemption — the driver cannot interrupt the GPU mid-dispatch. It
stops feeding new work at the next natural command buffer boundary.
- The preempt_context vtable entry serves double duty: on preemptible devices it
triggers true hardware preemption (~50μs-10ms); on non-preemptible devices it
triggers cooperative yield (latency bounded by longest in-flight command buffer).
The scheduler checks preemption_granularity to know which behavior to expect.
- Drivers for non-preemptible devices MUST report PreemptionGranularity::None in
get_info(). Misreporting this field is a driver correctness violation.
- The AccelScheduler logs a warning when a non-preemptible context exceeds
max_execution_us, records the overrun duration in AccelContextState::total_compute_ns,
and applies backpressure (delayed next submission) to amortize the overrun across the
cgroup's next scheduling period.
42.2.5 AccelComputeVTable — Compute-Specific Extensions
For devices with programmable compute (GPUs, some NPUs):
#[repr(C)]
pub struct AccelComputeVTable {
pub vtable_size: u64,
pub version: u32,
/// Query supported compute APIs (Vulkan compute, OpenCL, CUDA compat).
pub get_compute_caps: unsafe extern "C" fn(
ctx: *mut c_void,
out_caps: *mut ComputeCapabilities,
) -> IoResultCode,
/// Set up a shared virtual address space between CPU and device.
/// Enables SVM (Shared Virtual Memory) / unified addressing.
pub enable_svm: Option<unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
process_page_table: u64, // CR3 / TTBR0 of the owning process
) -> IoResultCode>,
/// Handle a device-initiated page fault.
/// Called by the driver when the device faults on an unmapped address.
/// The kernel resolves the fault (allocate page, migrate, map) and
/// tells the driver to retry.
pub handle_device_fault: Option<unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
fault_addr: u64,
fault_flags: u32,
out_resolution: *mut FaultResolution,
) -> IoResultCode>,
/// Get performance counters for a context.
pub get_perf_counters: Option<unsafe extern "C" fn(
ctx: *mut c_void,
context: AccelContextHandle,
counters: *mut AccelPerfCounters,
) -> IoResultCode>,
}
/// Resolution of a device-initiated page fault, returned by handle_device_fault.
/// Tells the kernel how to proceed after a GPU/NPU fault on a shared virtual
/// memory address.
#[repr(C)]
pub struct FaultResolution {
/// Action the driver should take.
pub action: FaultAction,
/// If action is MapLocal or MapRemote, the physical page to map.
/// Ignored for other actions.
pub page_paddr: u64,
/// If action is MapLocal or MapRemote, the page table entry flags
/// (readable/writable/executable, caching attributes).
pub pte_flags: u64,
}
#[repr(u32)]
pub enum FaultAction {
/// Retry the faulting access — kernel resolved it (e.g., allocated page).
Retry = 0,
/// Map the provided physical page locally (device-local memory).
MapLocal = 1,
/// Map the provided physical page as remote (system memory, peer device).
MapRemote = 2,
/// Deliver SIGBUS or SIGSEGV to the faulting context's owner process.
Signal = 3,
/// Kill the context (unrecoverable device error).
KillContext = 4,
}
/// Performance counters for an accelerator context.
/// Returned by get_perf_counters() for monitoring and accounting.
#[repr(C)]
pub struct AccelPerfCounters {
/// Total GPU/accelerator cycles executed for this context.
pub cycles_executed: u64,
/// Total bytes read from device memory.
pub bytes_read: u64,
/// Total bytes written to device memory.
pub bytes_written: u64,
/// Number of kernels/shaders executed.
pub kernel_count: u64,
/// Number of page faults (SVM/Unified Memory).
pub page_faults: u64,
/// Time spent stalled on memory (nanoseconds).
pub memory_stall_ns: u64,
/// Time spent executing (nanoseconds).
pub execution_ns: u64,
/// Device-specific counters (vendor-defined).
pub vendor_counters: [u64; 8],
pub _pad: [u8; 32],
}
#[repr(C)]
pub struct ComputeCapabilities {
/// Shader/kernel ISA generation (vendor-specific version number).
pub isa_version: u32,
/// Maximum work group / thread block size.
pub max_workgroup_size: u32,
/// Maximum shared / local memory per work group (bytes).
pub max_local_memory: u32,
/// Number of shader/compute cores.
pub shader_cores: u32,
/// FLOPS (single precision, billions).
pub fp32_gflops: u32,
/// FLOPS (half precision, billions, 0 if unsupported).
pub fp16_gflops: u32,
/// INT8 TOPS (billions of 8-bit integer ops, for inference).
pub int8_tops: u32,
/// Tensor core / matrix unit count (0 if none).
pub tensor_cores: u32,
pub _pad: [u8; 32],
}
42.2.6 AccelDisplayVTable — Display-Capable Extensions
For devices with display output (GPUs with connectors). This vtable covers KMS mode-setting and framebuffer management; rendering/compositing acceleration is handled through AccelComputeVTable (compute shaders) or vendor-specific extensions. Advanced display features (HDR metadata, content-adaptive sync policies, display stream compression) are reserved for Phase 4 (Section 14-roadmap).
/// Display vtable for accelerators with display output.
/// Covers mode setting, framebuffer management, and hotplug.
#[repr(C)]
pub struct AccelDisplayVTable {
pub vtable_size: u64,
pub version: u32,
/// Enumerate display connectors on this device.
pub get_connectors: unsafe extern "C" fn(
ctx: *mut c_void,
out_connectors: *mut AccelConnectorInfo,
max_connectors: u32,
out_count: *mut u32,
) -> IoResultCode,
/// Get supported display modes for a connector.
pub get_modes: unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
out_modes: *mut AccelDisplayMode,
max_modes: u32,
out_count: *mut u32,
) -> IoResultCode,
/// Set the active display mode for a connector.
pub set_mode: unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
mode: *const AccelDisplayMode,
) -> IoResultCode,
/// Create a framebuffer object from device memory.
pub create_framebuffer: unsafe extern "C" fn(
ctx: *mut c_void,
mem: AccelMemHandle,
width: u32,
height: u32,
format: u32,
out_fb: *mut AccelFramebufferHandle,
) -> IoResultCode,
/// Schedule a page flip (swap front/back buffer).
pub page_flip: unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
fb: AccelFramebufferHandle,
out_fence: *mut AccelFence,
) -> IoResultCode,
/// Set display power management state.
pub set_dpms_state: unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
state: AccelDpmsState,
) -> IoResultCode,
/// Get hotplug status for a connector.
// KABI: u8 instead of bool for stable ABI (0=false, nonzero=true)
pub get_hotplug_status: unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
out_connected: *mut u8,
) -> IoResultCode,
/// Read EDID data from a connector.
pub read_edid: unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
out_edid: *mut u8,
max_size: u32,
out_size: *mut u32,
) -> IoResultCode,
/// Set hardware cursor image and position.
pub set_cursor: Option<unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
image: *const u8,
width: u32,
height: u32,
hot_x: u32,
hot_y: u32,
) -> IoResultCode>,
/// Enable/disable variable refresh rate (FreeSync/G-Sync).
pub set_vrr_enabled: Option<unsafe extern "C" fn(
ctx: *mut c_void,
connector_id: u32,
enabled: u8,
) -> IoResultCode>,
}
/// Display connector information.
#[repr(C)]
pub struct AccelConnectorInfo {
/// Unique connector ID within this device.
pub connector_id: u32,
/// Connector type: HDMI=0, DisplayPort=1, DVI=2, VGA=3, eDP=4, Virtual=5.
pub connector_type: u16,
/// Maximum resolution width supported (0 if unknown).
pub max_width: u32,
/// Maximum resolution height supported (0 if unknown).
pub max_height: u32,
/// Currently connected (1) or disconnected (0).
pub connected: u8,
/// Currently active (has a mode set).
pub active: u8,
pub _pad: [u8; 6],
}
/// Display mode (resolution, refresh rate, timing).
#[repr(C)]
pub struct AccelDisplayMode {
/// Horizontal resolution in pixels.
pub width: u32,
/// Vertical resolution in lines.
pub height: u32,
/// Refresh rate in millihertz (e.g., 59950 for 59.95 Hz).
pub refresh_millihz: u32,
/// Clock frequency in kHz.
pub clock_khz: u32,
/// Horizontal front porch in pixels.
pub hfp: u16,
/// Horizontal sync pulse width in pixels.
pub hsw: u16,
/// Horizontal back porch in pixels.
pub hbp: u16,
/// Vertical front porch in lines.
pub vfp: u16,
/// Vertical sync pulse width in lines.
pub vsw: u16,
/// Vertical back porch in lines.
pub vbp: u16,
/// Preferred mode flag (monitor's preferred resolution).
pub preferred: u8,
pub _pad: [u8; 7],
}
/// Framebuffer handle (opaque).
#[repr(transparent)]
pub struct AccelFramebufferHandle(pub u64);
/// Command buffer handle (opaque, driver-allocated).
/// The driver allocates command buffer memory and returns this handle to the kernel
/// for tracking. The handle is used in `PendingSubmission` to identify which command
/// buffer to execute.
#[repr(transparent)]
pub struct AccelCmdBufferHandle(pub u64);
/// Semaphore handle (opaque, driver-allocated).
/// Used for cross-context synchronization. A submission can optionally signal a
/// semaphore on completion, allowing other contexts or userspace to wait on it.
#[repr(transparent)]
pub struct AccelSemaphoreHandle(pub u64);
/// Display power management state.
#[repr(u32)]
pub enum AccelDpmsState {
/// Display is on.
On = 0,
/// Display is in standby (minimal power, fast resume).
Standby = 1,
/// Display is suspended (low power, slower resume).
Suspend = 2,
/// Display is off.
Off = 3,
}
42.2.7 Scheduler Integration
The AccelScheduler interacts with the CPU scheduler to manage threads that are waiting for GPU work completion:
- GPU-blocked threads: Threads waiting on
poll_completionor a completion callback are placed in interruptible sleep (same state as I/O wait). They do not consume CPU time while waiting. - Completion wakeup: When the completion notification callback fires (see
register_completion_callback), the waiting thread is woken directly from the interrupt handler — no polling required. - CPU vruntime accounting: Time spent blocked on GPU completion is treated like I/O wait: it does not accrue CPU debt. On wakeup, the thread receives a latency bonus similar to I/O-bound tasks in CFS, ensuring responsive scheduling for interactive GPU workloads.
- Realtime priority contexts: For
AccelPriority::Realtimecontexts, the completion interrupt is handled by a dedicated high-priority IRQ thread to minimize wakeup latency.
42.3 Integration with ISLE Architecture
42.3.1 Device Registry Integration
Accelerators are modeled in the device registry (Section 7) with rich sub-device structure:
pci0000:00
+-- 0000:41:00.0 (NVIDIA A100 GPU)
+-- Properties:
| device-class: "gpu"
| compute-units: 108
| local-memory-bytes: 85899345920 (80 GB HBM2e)
| tensor-cores: 432
| p2p-capable: true
| preemption: "instruction"
+-- Services published:
| "accel-compute" (AccelComputeVTable)
| "rdma-target" (for GPUDirect RDMA)
+-- Children:
+-- partition0 (MIG 3g.40gb) # A100-80GB supports up to 2x 3g.40gb
+-- partition1 (MIG 3g.40gb)
+-- nvenc0 (video encoder sub-device)
+-- nvdec0 (video decoder sub-device)
Example: NVIDIA RTX 4090 (desktop GPU with display output):
pci0000:00
+-- 0000:01:00.0 (NVIDIA RTX 4090)
+-- Properties:
| device-class: "gpu"
| compute-units: 128
| local-memory-bytes: 25769803776 (24GB GDDR6X)
| tensor-cores: 512
| p2p-capable: false
| preemption: "instruction"
+-- Services published:
| "accel-compute" (AccelComputeVTable)
| "accel-display" (AccelDisplayVTable)
+-- Children:
+-- nvenc0 (video encoder sub-device)
+-- nvdec0 (video decoder sub-device)
Power Management Integration:
On system suspend, the AccelScheduler drains in-flight submissions (with a timeout
matching the device registry's suspend timeout, Section 7.5.3). Pending queue entries
are preserved across suspend/resume and resubmitted after the device resumes. The
AccelScheduler reports idle/busy state to the device registry for runtime power
management decisions. AccelPowerState maps to the registry's PowerState as follows:
D0Active ↔ active, D1LowPower ↔ idle.
42.3.2 Crash Recovery
GPU driver crashes are currently catastrophic in Linux. In ISLE:
1. GPU driver (Tier 1, domain-isolated) faults.
2. ISLE Core detects the fault.
3. Device registry transitions GPU node to Recovering.
4. AccelScheduler:
a. Marks all active contexts as "interrupted."
b. Completes all pending submissions with error status.
c. Notifies all processes waiting on completions.
5. Registry orchestrates crash recovery (Section 7.10):
a. Revoke driver's isolation domain.
b. GPU device reset (PCIe FLR or driver-specific reset).
c. Reload driver binary.
d. Fresh KABI vtable exchange.
6. AccelScheduler recreates contexts for processes that are still running.
- Process state is on CPU side (command buffers). Only the GPU-side state is lost.
- Processes receive an error on their in-flight submissions and can retry.
7. Total recovery time: ~100-500ms (dominated by GPU reset).
Compare Linux: full system reboot (30-60 seconds), loss of all work.
42.3.3 GPU Firmware as Cluster Member (Future)
Modern GPUs run sophisticated firmware that manages scheduling, memory, and P2P transfers. NVIDIA GPUs run proprietary firmware; AMD GPUs run open-source firmware. This firmware is effectively a specialized OS running on the GPU's control processor.
In ISLE's distributed kernel model (Section 47), GPU firmware can potentially participate as a first-class cluster member:
Current state (Phase 1-3): - Host kernel controls GPU via driver (Tier 1) - GPU firmware is passive — it responds to host commands but does not initiate - All scheduling decisions, memory allocation, and work dispatch are host-driven
Future state (Phase 5-6):
- GPU firmware implements ISLE cluster membership protocol
- GPU registers as a cluster node with NodeId, exposes VRAM as remote-accessible
memory in the device registry
- GPU firmware participates in DSM (Distributed Shared Memory) and DLM (Distributed
Lock Manager) — allows GPU-initiated RDMA transfers, GPU-to-GPU locking without
CPU involvement
- Work can be scheduled across the cluster transparently: a cgroup's workload could
span CPUs on node0, GPUs on node0 and node1, and a DPU on node2
Benefits: - GPU-initiated transfers: GPU firmware can directly request pages via DSM without waking the CPU driver, reducing latency for fine-grained data access patterns - GPU-to-GPU coordination: Two GPUs on different nodes can synchronize via DLM without routing through their respective host CPUs — critical for distributed training with NCCL/RCCL - Heterogeneous scheduling: The kernel scheduler sees CPUs and GPUs as a unified pool of compute resources, not separate domains
Requirements: - GPU firmware must implement ISLE's inter-kernel messaging protocol (see Section 47.2.2 "Device-local kernels as cluster members" for detailed protocol specification) - Three implementation paths: (A) run full ISLE on GPU's control processor, (B) firmware shim translating ISLE messages to native GPU operations, or (C) host-side proxy driver - Requires firmware modifications by NVIDIA/AMD/Intel or open-source GPU firmware (e.g., Nouveau, AMDGPU firmware) - Security model must prevent malicious GPU firmware from compromising host integrity — cluster membership is opt-in per device and requires signed firmware verification
This is not required for ISLE to function. It is an optional future enhancement that treats the "multikernel" model (one kernel per hardware domain) as a first-class design pattern rather than a hack. See Section 47.2.2 "Device-local kernels as cluster members" for SmartNIC/DPU equivalents.
42.3.4 FMA Integration
The FMA engine (Section 39) monitors accelerator health:
// Accelerator-specific health events
HealthEventClass::Accelerator // New class
// Health data for accelerators
#[repr(C)]
pub struct AccelHealthData {
/// GPU temperature (millidegrees Celsius).
pub temperature_mc: u32,
/// Power draw (milliwatts).
pub power_mw: u32,
/// ECC error count (VRAM).
pub ecc_correctable: u64,
pub ecc_uncorrectable: u64,
/// Thermal throttling events.
pub throttle_count: u32,
/// XID error code (NVIDIA) or equivalent.
pub error_code: u32,
/// PCIe replay count (indicates link instability).
pub pcie_replay_count: u32,
pub _pad: [u8; 20],
}
FMA diagnosis rules for accelerators:
| Rule | Threshold | Action |
|---|---|---|
| VRAM ECC degradation | 100 correctable / hour | Alert + schedule maintenance |
| VRAM uncorrectable error | 1 event | DisableDevice + Alert |
| Thermal throttling | 10 events / hour | Alert (may indicate cooling issue) |
| PCIe link unstable | 50 replays / minute | DemoteTier + Alert |
| Repeated driver crashes | 3 in 1 hour | DemoteTier (move to Tier 2) |
42.3.5 Stable Tracepoints
New stable tracepoints for accelerator observability (Section 40):
| Tracepoint | Arguments | Description |
|---|---|---|
isle_tp_stable_accel_submit |
device_id, context, cmd_size, priority | Command submitted |
isle_tp_stable_accel_complete |
device_id, context, latency_ns, error | Command completed |
isle_tp_stable_accel_preempt |
device_id, preempted_ctx, preempting_ctx | Context preempted |
isle_tp_stable_accel_migrate |
device_id, direction, pages, bytes | Memory migration |
isle_tp_stable_accel_fault |
device_id, context, fault_addr | Device page fault |
isle_tp_stable_accel_oom |
device_id, requested, available | Device memory exhaustion |
isle_tp_stable_accel_p2p |
src_device, dst_device, bytes, latency_ns | P2P DMA transfer |
42.3.6 Object Namespace
Accelerators appear in the unified object namespace (Section 41):
\Devices\pci0000:00\0000:41:00.0 # GPU device object
\Accelerators\gpu0 # Symlink to device
\Accelerators\gpu0\Contexts\ # Active execution contexts
\Accelerators\gpu0\Memory\ # Memory allocation tracking
\Accelerators\gpu0\Partitions\ # MIG partitions
Browsable via islefs:
cat /mnt/isle/Accelerators/gpu0/Memory
# type: AcceleratorMemory
# total: 40000000000 (40 GB)
# used: 27000000000 (27 GB)
# contexts:
# ctx[pid=1234]: 16106127360 (15 GB) - training job
# ctx[pid=5678]: 8589934592 (8 GB) - inference server
# ctx[pid=9012]: 4294967296 (4 GB) - development
42.3.7 Partial Failure Handling
Single-context error: If a single execution context encounters an error (shader
hang, invalid command), the AccelScheduler calls reset_context() on the affected
context only. Other contexts on the same device continue unaffected. The affected
process receives an error status on its next poll_completion call or via the
completion callback.
ECC uncorrectable error: When device memory reports an uncorrectable ECC error,
the affected page is retired (analogous to CPU page retirement, Section 39.6). The
owning context is notified, and if possible, data is migrated to a healthy page. If
the page content is unrecoverable, the owning process receives SIGBUS.
Timeout without crash: The AccelScheduler enforces max_execution_us via a kernel
timer. When a submission exceeds its time limit:
- If the device supports instruction-level preemption: preempt the overdue context,
save its state, and return AccelCompletionStatus::Preempted.
- If preemption is not supported: wait until the current command buffer boundary, then
fail the overdue submission with AccelCompletionStatus::Timeout.
- In either case, the context remains valid and the process can submit new work.
42.3.8 Open Questions
The following items require further design work:
- Full
/dev/isle-accel-Nioctl specification (number assignments, structure layouts, versioning scheme). - Context save/restore mechanism for mid-shader preemption (how much state must be saved, where it is stored, latency budget for save/restore).
- Multi-GPU unified memory coherence granularity: should the coherence unit be 4KB (standard pages), 64KB (GPU-friendly), or 2MB (huge pages)? Trade-off between false sharing and transfer overhead.
HealthEventClass::Acceleratorformal taxonomy: define the complete set of health event types, severity levels, and recommended actions for accelerator hardware.
42.4 Implementation Phasing
| Component | Phase | Dependencies | Notes |
|---|---|---|---|
| AccelBaseVTable KABI definition | Phase 3 | Driver SDK, device registry | Define the interface first |
| Accelerator scheduler (basic) | Phase 3-4 | AccelBase, cgroups | Round-robin, then priority |
| DRM/KMS compatibility shim | Phase 3-4 | AccelBase | Required for desktop GPU |
| Heterogeneous memory (basic) | Phase 4 | Memory manager, AccelBase | CPU-device migration |
| P2P DMA | Phase 4 | IOMMU, AccelBase | Requires IOMMU support |
| Cgroup accel controller | Phase 4 | AccelScheduler, cgroups | Compute time + memory limits |
| CBS bandwidth guarantees | Phase 4-5 | AccelScheduler, CBS | Guaranteed compute time |
| Hardware preemption | Phase 5 | AccelScheduler, GPU driver | Driver-dependent |
| In-kernel inference (basic) | Phase 4 | Decision trees only | Start with page prefetching |
| In-kernel inference (neural) | Phase 5 | Quantized INT8 runtime | Requires validation |
| RDMA KABI | Phase 5 | Network drivers | Separate from GPU |
| GPUDirect RDMA | Phase 5+ | RDMA + P2P DMA | Cross-driver |
| NVIDIA KABI driver (basic init) | Phase 3-4 | AccelBase, KABI, ioctl compat | nvidia-smi + simple CUDA (Section 46.2.3) |
| NVIDIA KABI driver (UVM) | Phase 4-5 | ISLE HMM, NVIDIA basic | cudaMallocManaged + multi-GPU (Section 46.2.3) |
| NVIDIA KABI driver (production) | Phase 5 | All NVIDIA components | Full CUDA stack + crash recovery |
| Device partitioning | Phase 5+ | AccelScheduler, registry | Hardware-dependent |
| Topology-assisted collectives | Phase 5+ | Registry, RDMA, P2P | Optimization layer |
Priority Rationale
Phase 3-4 (Real Workloads): Basic accelerator framework + DRM compat + simple scheduling. This is the minimum for "GPU works on ISLE."
Phase 4-5 (Production Ready): Cgroup integration, memory management, P2P DMA, in-kernel inference basics. This is when ISLE becomes better than Linux for AI/ML workloads.
Phase 5+ (Ecosystem): Advanced scheduling, RDMA, collectives, partitioning. This is the competitive advantage phase — features that Linux fundamentally cannot provide due to architectural constraints.
42.5 Licensing Summary
| Component | IP Source | Risk |
|---|---|---|
| AccelBase KABI | Original design (vtable pattern from existing KABI) | None |
| Accelerator scheduler | CBS (academic, 1998) + priority scheduling (textbook) | None |
| Heterogeneous memory | Academic (HMM concepts, published research) | None |
| P2P DMA | PCIe spec (public), IOMMU spec (public) | None |
| DRM/KMS compat | Linux DRM (GPLv2 interface, ioctl numbers are facts) | None |
| In-kernel inference | Original design, standard ML algorithms (public) | None |
| RDMA | InfiniBand spec (public), RoCE spec (public) | None |
| Cgroup controller | Linux cgroup v2 interface (filesystem paths are facts) | None |
| NVIDIA KABI driver | New code inspired by MIT/GPLv2 open-source nvidia.ko | None (MIT-compatible, OKLF-clean) |
All vendor-specific GPU features (MIG, NVLink, etc.) are accessed through the AccelBase KABI vtable — the driver implements them, not the kernel. The kernel provides the framework; vendors fill in the hardware-specific logic through KABI.
43. Accelerator Memory and P2P DMA
43.1 Heterogeneous Memory Management
43.1.1 Problem
AI models are growing exponentially. A 70B parameter model at FP16 is ~140GB. GPU VRAM is typically 24-80GB. Models larger than VRAM need transparent memory management — pages migrating between CPU RAM and GPU VRAM based on access patterns.
Linux has hmm (Heterogeneous Memory Management) and mmu_notifiers, but they are
bolted onto the existing MM and poorly integrated. Each vendor (NVIDIA UVM, AMD SVM,
Intel SVM) implements their own page migration logic in the driver.
43.1.2 Design: Accelerator Memory as NUMA Nodes
The memory manager (Section 12) already has NUMA awareness — per-node buddy allocators, NUMA-local page allocation, NUMA-aware page cache placement.
Key insight: Accelerator memory is just another NUMA node. A GPU with 24GB VRAM is NUMA node N+1, with different performance characteristics (higher bandwidth, higher latency from CPU, not directly CPU-accessible).
Existing NUMA topology (Section 12):
NUMA Node 0 (CPU 0-15) NUMA Node 1 (CPU 16-31)
64GB DDR5 64GB DDR5
Per-CPU page caches Per-CPU page caches
Buddy allocator 0 Buddy allocator 1
Extended with accelerator memory:
NUMA Node 2 (GPU 0 VRAM) NUMA Node 3 (GPU 1 VRAM)
24GB GDDR6X 24GB GDDR6X
Managed by GPU driver Managed by GPU driver
via AccelBase vtable via AccelBase vtable
43.1.3 Memory Node Types
// isle-core/src/mem/numa.rs (extend existing)
pub enum NumaNodeType {
/// Standard CPU-attached DDR memory.
CpuMemory,
/// Accelerator device-local memory (VRAM, HBM).
/// Managed through the AccelBase KABI vtable.
AcceleratorMemory {
device_id: DeviceNodeId,
bandwidth_gbs: u32, // GB/s. e.g., 819 for HBM3 (per-stack, 6.4 Gbps/pin),
// 1229 for HBM3E (JEDEC max at 9.8 Gbps/pin;
// actual products: ~1000-1200 per stack)
latency_ns: u32, // e.g., 500-1000 for PCIe GPU
cpu_visible: bool, // Can CPU access this memory via BAR?
cpu_visible_size: u64, // Size of CPU-visible window (may be < total)
coherent: bool, // CPU-device cache coherent? (CXL = yes, PCIe = no)
},
/// CXL-attached memory (high-capacity, CPU-accessible, higher latency).
CxlMemory {
latency_ns: u32,
/// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
bandwidth_gbs: u32,
},
/// CXL 3.0 shared memory pool visible to multiple nodes.
/// See Section 47.12 for CXL fabric integration details.
CxlSharedPool {
latency_ns: u32,
/// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
bandwidth_gbs: u32,
/// Nodes that share this pool (bitfield).
sharing_nodes: u64,
/// Hardware coherence protocol version.
coherence_version: CxlCoherenceVersion,
},
}
43.1.4 Transparent Page Migration
When a device with hw_page_faults = 1 (modern GPUs with ATS/PRI support) faults
on an unmapped address, the following occurs:
1. Device encounters page fault at virtual address VA.
2. Device sends ATS (Address Translation Services) fault to IOMMU.
3. IOMMU routes fault to ISLE Core's device fault handler.
4. ISLE Core looks up VA in the owning process's VMA tree:
a. If VA maps to a CPU-resident page:
- Migrate the page from CPU NUMA node to device NUMA node
(via driver's migrate_pages callback)
- Update CPU page tables (unmap from CPU)
- Update device page tables (map on device)
- Resume device execution
b. If VA is unmapped:
- Allocate a fresh page on the device's NUMA node
- Map in device page tables
- Resume device execution
c. If VA maps to another device's memory:
- P2P migration (see Section 43)
5. Track this page's location in the unified page tracking structure.
For devices WITHOUT hardware page faults (older GPUs, most NPUs): the kernel uses software-assisted migration. Before submitting a command buffer, the scheduler identifies which pages the command buffer references and pre-migrates them. This is less efficient but functionally equivalent.
43.1.5 Page Location Tracking
// isle-core/src/mem/hmm.rs (kernel-internal)
/// Tracks where each page of a shared address space currently resides.
pub struct PageLocationTracker {
/// Per-page location (indexed by virtual page number).
/// Each entry records which NUMA node currently holds this page.
locations: RadixTree<PageLocation>,
/// Per-page migration epoch counter (indexed by virtual page number).
/// Incremented each time the page transitions to `Migrating` state.
/// Used to detect stale migration completions after concurrent recovery.
/// Each entry is a single `AtomicU64`, stored in a separate radix tree
/// to avoid bloating the `PageLocation` enum (which is already 24 bytes).
migration_epochs: RadixTree<AtomicU64>,
/// Side table for in-flight migrations. Slab-allocated, bounded by
/// `MAX_CONCURRENT_MIGRATIONS` (default 4096). The `Migrating` variant
/// stores only the slab index (`migration_id: u32`), keeping the
/// per-page `PageLocation` entry small.
active_migrations: Slab<MigrationRecord>,
}
/// Full source/target metadata for an in-flight page migration.
/// Stored in `PageLocationTracker::active_migrations`, referenced by
/// `PageLocation::Migrating { migration_id }`.
pub struct MigrationRecord {
pub source_kind: PageLocationKind,
pub source_node: u8,
pub source_device: DeviceNodeId,
pub source_addr: u64,
pub target_kind: PageLocationKind,
pub target_node: u8,
pub target_device: DeviceNodeId,
pub target_addr: u64,
}
/// Stored per-page in `PageLocationTracker::locations` (one entry per virtual page).
/// `#[repr(u8)]` keeps the discriminant to one byte; the full variant data is in the
/// enum payload. The radix tree stores the enum inline, so every extra byte of
/// discriminant wastes space proportional to the mapped address-space size.
#[repr(u8)]
pub enum PageLocation {
/// Page is in CPU memory on this NUMA node.
CpuNode(u8),
/// Page is in accelerator device-local memory.
DeviceLocal {
device_id: DeviceNodeId,
device_addr: u64, // Device-side physical/virtual address
},
/// Page is in transit (being migrated).
/// Stores only a side-table index to avoid bloating every per-page entry.
/// The `MigrationRecord` (source, target, addresses) is stored in
/// `PageLocationTracker::active_migrations` — a slab of fixed capacity
/// bounded by `MAX_CONCURRENT_MIGRATIONS`. Migrations are transient
/// (milliseconds), so the number of in-flight entries is small relative
/// to total pages. This keeps `PageLocation` at 24 bytes (driven by
/// `RemoteNode` / `RemoteDevice`) instead of ~48 bytes.
Migrating {
migration_id: u32,
},
/// Page is not yet allocated (demand-paged).
NotPresent,
/// Page is in compressed pool (Section 13).
Compressed,
/// Page is in swap.
Swapped,
// === Distributed memory locations (see Section 47.5 for the DSM protocol) ===
/// Page is on a remote node's CPU memory, accessible via RDMA.
RemoteNode {
node_id: NodeId,
remote_phys_addr: u64,
dsm_state: DsmPageState,
},
/// Page is on a remote node's accelerator memory (GPUDirect RDMA).
RemoteDevice {
node_id: NodeId,
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is in CXL-attached memory pool (hardware-coherent).
/// See Section 47.12 for CXL fabric integration details.
CxlPool {
pool_id: u32,
pool_offset: u64,
},
}
/// Discriminant for `PageLocation` variants, used in `MigrationRecord`
/// to record the source/target kind of an in-flight migration.
#[repr(u8)]
pub enum PageLocationKind {
CpuNode = 0,
DeviceLocal = 1,
NotPresent = 2,
Compressed = 3,
Swapped = 4,
RemoteNode = 5,
RemoteDevice = 6,
CxlPool = 7,
}
43.1.6 Migration Policy
// isle-core/src/mem/hmm.rs
pub struct MigrationPolicy {
/// Migrate on first device fault (aggressive, more migration traffic
/// but lower steady-state fault rate).
pub migrate_on_first_fault: bool,
/// Migrate only after N faults on the same page within a window
/// (conservative, less migration traffic, higher fault rate).
pub fault_threshold: u32,
pub fault_window_ms: u32,
/// Batch migration: when migrating, also prefetch nearby pages
/// (spatial locality heuristic).
pub prefetch_pages: u32, // 0 = no prefetch, 16 = prefetch 16 neighbors
/// Maximum pages in-flight (being migrated) at once.
pub max_inflight_migrations: u32,
/// Eviction policy when device memory is full:
/// LRU (least recently accessed on device) is the default.
pub eviction_policy: EvictionPolicy,
}
pub enum EvictionPolicy {
/// Evict least recently accessed pages back to CPU memory.
Lru,
/// Evict pages with lowest access frequency.
Lfu,
/// Use learned policy (in-kernel inference, Section 45).
Learned,
}
43.1.7 Memory Oversubscription
When a model is larger than device VRAM:
Total model: 140GB (70B params, FP16)
GPU VRAM: 24GB
CPU RAM available: 256GB
Strategy:
1. Most-accessed layers (attention heads, active KV cache) → GPU VRAM
2. Less-accessed layers (early embedding, output projection) → CPU RAM
3. Migration on demand: when GPU needs a page in CPU RAM, migrate it
and evict a cold page from VRAM back to CPU RAM
The kernel manages this transparently. The ML framework sees a unified
140GB address space. Page migration is handled by the kernel's device
fault handler, exactly like CPU demand paging handles more-virtual-memory-
than-physical-RAM.
This is the accelerator equivalent of virtual memory. The kernel has been managing memory oversubscription since the 1960s. Now it does it for GPU VRAM too.
43.1.8 Huge Page Support for Tensors
ML tensors benefit from huge pages (fewer TLB misses, better DMA performance):
/// Page size selection for accelerator memory allocations.
/// This is a separate enum (not bitflags) because page sizes are mutually
/// exclusive — a single allocation uses exactly one page size.
#[repr(u32)]
pub enum AccelPageSize {
/// Standard 4KB pages.
Standard = 0,
/// 2MB huge pages.
HugePage2M = 1,
/// 1GB huge pages.
HugePage1G = 2,
/// Device-optimal page size (device driver selects the best size).
DeviceOptimal = 3,
}
/// Accelerator memory allocation modifier flags (OR-combinable).
/// Combined with a separate AccelPageSize field to fully describe
/// an allocation request.
bitflags::bitflags! {
#[repr(transparent)]
pub struct AccelMemFlags: u32 {
/// Non-migratable: pin in device memory, never evict to CPU.
const PINNED = 0x100;
/// CPU-visible: map into CPU address space via BAR.
const CPU_VISIBLE = 0x200;
/// Coherent: CPU-device coherent (requires CXL or similar).
const COHERENT = 0x400;
}
}
// Usage: page_size = AccelPageSize::HugePage2M,
// flags = AccelMemFlags::PINNED | AccelMemFlags::CPU_VISIBLE
43.1.9 Multi-Device Memory Coherence
When multiple GPUs access the same virtual address range (e.g., tensor parallelism across GPUs on the same machine), the kernel must manage coherence between device memories.
The PageLocation enum already includes DeviceLocal for single-device pages. For
read-only copies replicated across multiple GPUs, the kernel uses a single-owner
coherence protocol with explicit synchronization:
-
Single-owner with read sharing: At any time, one device owns the writable copy. Other devices may hold read-only copies. This mirrors the DSM protocol in Section 47.5.
-
Per-page ownership lock and wait queue: Each shared page has a per-page spinlock in the
PageLocationTrackerthat serializes ownership transitions, plus a per-page wait queue for faulters that arrive during a migration. The spinlock protects only the page metadata (state, owner, reader list) — it is held for tens of nanoseconds, never across I/O. When a device faults on a page, the fault handler acquires this lock to inspect or modify the page'sPageLocationstate. If the page is inMigratingstate, the faulter releases the spinlock and blocks on the per-page wait queue (not spinning) until the migration completes and wakes all waiters. The lock is per-page (not per-device or global) to avoid contention on unrelated pages. -
Explicit ownership transfer protocol: When a device writes to a page owned by another device:
- Fault handler acquires the per-page ownership spinlock.
- A
MigrationRecord(source and target) is allocated in the side table and the page state transitions toPageLocation::Migrating { migration_id }. ThePageLocationEntry'smigration_epochcounter is incremented. The migrator captures this epoch value before releasing the spinlock. - Per-page ownership spinlock is released. The
Migratingstate prevents other faulters from modifying the page — they will seeMigratingand block on the per-page wait queue until the migration completes. 3a. Pre-revocation fence (NEW): Before revoking the source mapping (step 4), the kernel issues a device-specific quiesce command viapreempt_context()on the source device's active context. This ensures the source device has completed all in-flight writes to the page before the DMA copy reads the data. The fence is verified viaAccelFencecompletion before proceeding to step 5. - The owning device's mapping is revoked (device page table unmap via driver callback). This is a slow operation (microseconds) and runs outside the spinlock.
- P2P DMA transfers the page data to the new owner (Section 43.2). This may take tens to hundreds of microseconds and also runs outside the spinlock.
- All devices holding read-only copies are invalidated via the AccelScheduler (which tracks all active contexts for a given address space). Each invalidation unmaps the page from the device's page table via the driver's callback.
- Fault handler re-acquires the per-page ownership spinlock. Stale migration
detection: the migrator checks that
page.migration_epochstill matches the epoch value captured at step 2. If the epoch has changed (because a crash recovery or concurrent migration reset the page state), the migration is stale — the migrator discards the DMA result and returnsEAGAINto trigger a fresh fault. This prevents stale migration completions from corrupting page state after concurrent recovery paths have already resolved the page. - The new owner's mapping is installed and page state transitions to
DeviceLocal. - Per-page ownership spinlock is released.
- All waiters on the per-page wait queue are woken. They re-fault and see the
new
DeviceLocalstate.
Steps 4-6 (revocation, DMA transfer, and reader invalidation) run outside the
spinlock, so they do not block other faulters on unrelated pages or cause
spin-time proportional to I/O latency. Faulters that arrive during steps 4-6
see Migrating state and sleep on the wait queue instead of spinning.
Steps 4-6 must complete before step 8 (new mapping). This is a write-invalidate
protocol — the same model used by CPU cache coherence (MOESI) and the DSM
protocol in Section 47.5, adapted for device memory.
migration_epoch field: Each page tracking entry in PageLocationTracker
includes a migration_epoch: u64 counter. This counter is incremented at step 2
each time a page transitions to Migrating. The migrator captures the epoch
value after the increment and before releasing the spinlock. At step 7, after
re-acquiring the spinlock, the migrator compares the current migration_epoch
against the captured value. A mismatch indicates that a device crash + recovery
or a concurrent migration has intervened during the lock-free DMA window and has
already resolved (or restarted) the page migration independently. In that case
the DMA result is discarded and EAGAIN is returned; the faulting device will
re-fault and enter a fresh migration sequence.
- Migration error and timeout handling: Migration is a multi-step I/O operation
(device page table manipulation + DMA transfer) that can fail at any point.
Without explicit recovery, a page can be stuck permanently in
Migratingstate, blocking all faulters on its wait queue indefinitely.
Timeouts: Every migration has a deadline: 100ms for intra-node transfers
(devices on the same PCIe root complex or NVLink fabric), 1s for cross-node
transfers (remote devices via RDMA or CXL). The fault handler starts a kernel
timer when it transitions the page to Migrating (step 2 above). If the timer
fires before the migration completes (step 8), the timeout handler initiates
rollback.
Migration result:
```rust /// Outcome of a page migration attempt. #[repr(u8)] pub enum MigrationResult { /// Migration completed successfully. Page is now owned by the target device. Success,
/// Migration failed (DMA timeout, device error, IOMMU fault). Page has been
/// rolled back to its original location on the source device. Waiters are
/// woken to re-fault and will find the page at its original location.
RollbackToSource,
/// Source device crashed or was reset during migration. The page data in
/// device memory is lost. Page is marked `NotPresent` so the next fault
/// reloads from backing store (CPU memory copy or disk).
SourceLost,
} ```
Failure recovery by case:
-
DMA timeout or destination device error (most common — e.g., P2P DMA times out, destination device returns an error during page write):
- Cancel or abandon the in-flight DMA transfer.
- The source device still holds the original page data (it was not unmapped until the transfer completes).
- Re-acquire the per-page ownership spinlock.
- Transition page state back from
Migratingto its originalDeviceLocalorCpuNodestate (theMigrationRecordin the side table, looked up via theMigratingvariant'smigration_id, records exactly where to roll back to). - Release the per-page ownership spinlock.
- Wake all waiters on the per-page wait queue. They re-fault and find the page at its original location.
- Increment
migration_failure_countin thePageLocationTrackerstats. - Result:
MigrationResult::RollbackToSource.
-
Source device crash during migration (rare — the source device is reset or its driver crashes while the page is in
Migratingstate):- The page data in device memory is physically lost (device reset destroys VRAM contents — see Section 42.3.2).
- The DMA transfer (if in progress) is aborted by the device reset.
- Re-acquire the per-page ownership spinlock.
- Transition page state to
PageLocation::NotPresent. - Release the per-page ownership spinlock.
- Wake all waiters on the per-page wait queue with an error indication.
Waiters re-fault and trigger a fresh page-in from backing store: if the
page has a CPU memory copy (e.g., it was migrated to the device from CPU
RAM and the CPU copy was preserved or the page is file-backed), the fault
handler loads from the backing store. If no backing store exists (anonymous
page that was only in device memory), the owning process receives
SIGBUS. - Increment
migration_failure_countandsource_lost_countin stats. - Result:
MigrationResult::SourceLost.
-
Waiter notification on failure: Waiters on the per-page wait queue are woken with an error code (
MigrationFailedorSourceDeviceLost) embedded in the wait queue wake event. On wakeup, each waiter re-enters the fault handler, re-acquires the per-page spinlock, and inspects the current page state. ForRollbackToSource, the page is back at its original location and the waiter proceeds normally (read-share or initiate a fresh migration attempt). ForSourceLost, the page isNotPresentand the waiter triggers a fresh fault resolution (page-in from backing store orSIGBUS). -
Repeated migration failures: If a page accumulates more than 3 migration failures within a 60-second window (tracked per-page in
PageLocationTracker), the page is marked as non-migratable (pinned at its current location) for a cooldown period (5 minutes). This prevents pathological retry loops on pages that consistently fail to migrate (e.g., due to a flaky P2P DMA path). -
Read-sharing acquisition: When a device reads a page owned by another device, the fault handler acquires the per-page spinlock, checks page state. If the page is in
Migratingstate, the faulter releases the spinlock and blocks on the per-page wait queue (same as for write faults). Otherwise, the faulter records the additional reader in the page metadata, releases the spinlock, then performs the P2P DMA copy to create a read-only replica on the requesting device (outside the spinlock). No ownership change occurs. -
Distributed cluster coherence: For multi-device coherence across machines (not just local GPUs), the per-page spinlock is replaced by the DSM directory protocol (Section 47.5, 12-distributed.md). The DSM protocol already provides single-owner / multi-reader semantics with a home-node directory that serializes ownership transitions across network boundaries — the same semantics as the local per-page spinlock, but designed for cross-node operation. Local multi-GPU coherence uses the lightweight per-page spinlock and wait queue; cross-machine coherence uses the DSM directory. The DLM (Section 31a) is for filesystem-level distributed locking, not page-level coherence. The write-invalidate coherence protocol is identical in both cases — only the ownership serialization mechanism differs (local spinlock vs. DSM directory).
Huge page requirement for multi-GPU coherence: The per-page coherence tracking
structures (GpuCoherenceEntry, ~64 bytes per page) mandate huge pages (2 MB) when
tracking >4 GPUs simultaneously. For a 4-GPU system with 256 GB VRAM total (64 GB per GPU,
16M pages per GPU), the coherence table requires ~1 GB. Using 4 KB pages for this
allocation would consume 256K page table entries and cause TLB thrashing during
coherence lookups. The allocator automatically promotes coherence table allocations
to 2 MB huge pages when gpu_count >= 4. On systems without huge page support, the
coherence table is capped at 4 GPUs; attempting to register a 5th GPU returns -ENOMEM.
For the common case of gradient all-reduce (all GPUs read the same weights but write different gradients), the kernel keeps weights as shared-read on all devices and only migrates gradient pages on write. Since each GPU writes to its own gradient partition (non-overlapping address ranges), per-page locks are uncontended in the common case.
43.2 Peer-to-Peer DMA
43.2.1 Problem
AI workloads need data to flow between devices without CPU involvement:
- Storage → GPU: Load training data directly from NVMe to GPU VRAM (GPUDirect Storage)
- Network → GPU: Receive model weights from remote node directly to GPU (RDMA + GPUDirect)
- GPU → GPU: Gradient exchange between GPUs in multi-GPU training (NVLink, xGMI, PCIe P2P)
- GPU → Network: Send inference results directly from GPU to network (bypassing CPU copy)
Linux supports these as vendor-specific driver features. There is no general kernel mechanism.
43.2.2 Design: Generalized P2P DMA
Extend the KABI with a general mechanism for device-to-device DMA transfers:
// Appended to KernelServicesVTable
/// Set up a peer-to-peer DMA mapping between two devices.
/// The kernel programs the IOMMU to allow device A to DMA to device B's
/// memory region.
pub p2p_dma_map: Option<unsafe extern "C" fn(
src_device: DeviceHandle,
dst_device: DeviceHandle,
dst_mem: AccelMemHandle, // Or DmaBufferHandle for non-accel devices
dst_offset: u64,
size: u64,
out_iova: *mut u64, // IOVA that src_device can use to reach dst memory
out_mapping: *mut P2pMappingHandle,
) -> IoResultCode>,
/// Tear down a P2P DMA mapping.
pub p2p_dma_unmap: Option<unsafe extern "C" fn(
mapping: P2pMappingHandle,
) -> IoResultCode>,
/// Initiate a DMA transfer between two devices.
/// The kernel coordinates between the two drivers.
pub p2p_dma_transfer: Option<unsafe extern "C" fn(
mapping: P2pMappingHandle,
src_offset: u64,
dst_offset: u64,
size: u64,
out_fence: *mut AccelFence,
) -> IoResultCode>,
43.2.3 IOMMU Integration
P2P DMA requires careful IOMMU programming:
Without P2P:
Device A → IOMMU → CPU RAM ← IOMMU ← Device B
(two DMA transfers, CPU RAM as bounce buffer)
With P2P:
Device A → IOMMU → Device B's BAR / memory
(one DMA transfer, no CPU involvement)
The kernel validates:
1. Both devices support P2P (PCIe ACS, correct topology)
2. IOMMU can map across devices (same IOMMU domain or IOMMU supports cross-device mapping)
3. The requesting driver has the P2P_DMA capability
4. The target memory region is within the bounds of the P2P mapping
If hardware P2P is not possible (devices behind different root complexes without P2P bridge support), the kernel falls back to CPU-mediated copy transparently. The driver API is the same either way.
43.2.4 Topology-Aware Placement
The device registry (Section 7) provides topology information. The kernel uses this for P2P path selection:
PCIe Topology:
Root Complex 0
+-- Root Port 0
| +-- GPU 0 (VRAM: 24GB)
| +-- NVMe 0
+-- Root Port 1
| +-- GPU 1 (VRAM: 24GB)
| +-- NIC (100GbE)
P2P capability matrix:
GPU 0 ↔ NVMe 0: Direct P2P (same root port) — optimal
GPU 0 ↔ GPU 1: P2P via root complex — good
GPU 0 ↔ NIC: P2P via root complex — good
GPU 1 ↔ NVMe 0: P2P via root complex — good
The scheduler uses this topology when placing workloads: a training job that loads data from NVMe 0 is preferentially scheduled on GPU 0 (same root port = best P2P path).
43.2.5 P2P Access Control
Peer-to-peer DMA allows devices to directly access each other's memory without CPU involvement. Without proper access control, a malicious or compromised device could read or corrupt another device's memory. This section specifies the mandatory access control mechanisms for all P2P DMA operations.
Authorization Model
Every P2P DMA mapping requires explicit authorization from both the source and target device owners:
P2P DMA Authorization Flow:
1. Client (process or driver) requests P2P mapping: src_device → dst_device memory
2. Kernel checks:
a. Client holds ACCEL_P2P capability (0x0102) for BOTH devices
b. Both devices are in the client's allowed device set (cgroup accel.devices)
c. Target memory region is within the client's allocated AccelMemHandle
d. Source device's cgroup has not exceeded accel.p2p.max limit (if set)
3. Kernel programs IOMMU with P2P mapping (never direct device-to-device without IOMMU)
4. P2pMappingHandle is bound to the requesting context — not transferable
5. Mapping is recorded in both devices' P2P ACL tables
Capability-Based Device ACLs
Each device maintains a P2P Access Control List (ACL) that specifies which other devices are authorized to initiate P2P DMA to its memory:
/// P2P ACL entry — one per authorized (src_device, dst_device) pair.
/// Stored in the device registry node for the target device.
#[repr(C)]
pub struct P2pAclEntry {
/// Source device that is authorized to initiate P2P DMA.
pub src_device_id: DeviceNodeId,
/// Capability token that authorized this ACL entry.
/// Must be valid for the ACL entry to be valid.
pub authorizing_cap: CapabilityToken,
/// Maximum bytes that can be transferred via this ACL entry.
/// Optional: 0 means unlimited (subject to cgroup limits).
pub max_bytes: u64,
/// Bytes transferred so far (reset on ACL refresh or revocation).
pub bytes_transferred: AtomicU64,
/// Creation timestamp (for auditing and expiration).
pub created_ns: u64,
/// Expiration timestamp (0 = no expiration).
pub expires_ns: u64,
}
The ACL is consulted on every p2p_dma_map call. The mapping is denied if:
- No ACL entry exists for (src_device, dst_device)
- The authorizing capability has been revoked
- The byte limit has been exceeded
- The ACL entry has expired
IOMMU Mediation
All P2P DMA transactions are mediated by the IOMMU. Direct device-to-device DMA without IOMMU translation is never permitted:
Security-Critical Design Decision:
P2P DMA Path: Device A → IOMMU → Device B's memory
NOT: Device A → Device B (direct PCIe P2P without IOMMU)
Rationale:
1. IOMMU provides address translation and access validation on every transaction
2. IOMMU can revoke access instantly by invalidating the IOVA mapping
3. IOMMU provides fault isolation if a device goes rogue
4. IOMMU enables per-transaction logging for security auditing
The IOMMU is programmed with per-device IOVA page tables. A P2P mapping creates an IOVA entry in the source device's page table that translates to the target device's physical BAR address. When the mapping is revoked, the IOVA entry is invalidated.
Revocation Protocol
P2P access must be revoked when: - The authorizing capability is revoked (process exit, cgroup removal) - A device is removed from the system (hot-unplug) - A device is quarantined due to error or security event (Section 42.3.2) - The cgroup limit is exceeded - An explicit unmap request is made
Revocation is synchronous and blocking:
P2P Revocation Sequence:
1. Kernel receives revocation trigger (capability revoke, device quarantine, etc.)
2. Kernel looks up all P2P mappings involving the affected device(s)
3. For each active mapping:
a. Set mapping state to REVOKING
b. Issue IOMMU TLB invalidation for the IOVA range
c. Wait for in-flight transactions to complete (see below)
d. Remove the mapping from both devices' ACL tables
e. Free the P2pMappingHandle
4. Signal completion to revocation requester
In-Flight Transaction Handling
The critical challenge is handling P2P DMA transactions that are in-flight when revocation is requested. The kernel uses a quiesce-then-revoke protocol:
/// P2P mapping state machine.
#[repr(u8)]
pub enum P2pMappingState {
/// Mapping is active and can be used for new transfers.
Active = 0,
/// Mapping is being revoked. No new transfers allowed.
/// In-flight transactions are completing.
Revoking = 1,
/// Mapping is fully revoked. IOVA is invalid.
Revoked = 2,
}
/// In-flight transaction tracking.
/// Each P2P mapping tracks pending transfers.
pub struct P2pMapping {
pub handle: P2pMappingHandle,
pub src_device: DeviceNodeId,
pub dst_device: DeviceNodeId,
pub state: AtomicU8, // P2pMappingState
pub iova_base: u64,
pub iova_size: u64,
/// Number of in-flight transfers using this mapping.
/// Incremented before p2p_dma_transfer, decremented after fence completion.
pub in_flight_count: AtomicU32,
/// Wait queue for revocation to wait on in-flight completion.
pub revoke_wait_queue: WaitQueue,
}
Revocation waits for in-flight transactions:
In-Flight Handling During Revocation:
1. Set state to REVOKING (atomic store)
2. New p2p_dma_transfer calls fail with EREVOKED
3. Wait for in_flight_count to reach zero:
- Each transfer increments in_flight_count before submission
- Each transfer decrements after fence signals completion
- Timeout: 100ms for local P2P, 1s for cross-node RDMA
4. If timeout expires:
a. Issue device quiesce via driver callback (preempt_context)
b. Force IOMMU TLB invalidation
c. Log warning: "P2P revocation timed out, forced quiesce"
d. Proceed with revocation (data may be lost or corrupted)
5. Set state to REVOKED
6. Free mapping resources
P2P Memory Accounting
All P2P DMA memory usage is accounted to the originating cgroup:
/sys/fs/cgroup/<group>/accel.p2p.current
# Current bytes transferred via P2P DMA (read-only)
# Sum across all devices this cgroup can access
/sys/fs/cgroup/<group>/accel.p2p.max
# Maximum bytes that can be transferred via P2P in the current period
# Format: "bytes period_us"
# Example: "10737418240 1000000" (10GB per second)
# Default: "max" (unlimited)
/sys/fs/cgroup/<group>/accel.p2p.stat
# P2P statistics (read-only):
# total_bytes <cumulative bytes transferred>
# mappings <current active P2P mappings>
# revocations <times P2P was revoked>
# timeout_revocations <revocations that timed out waiting for in-flight>
The byte counter is incremented on each p2p_dma_transfer completion (fence signaled).
If a transfer fails or is aborted, the bytes are not counted.
Device Quarantine and P2P
When a device is quarantined (Section 42.3.2), all P2P mappings involving that device are immediately revoked using the protocol above. This prevents a quarantined device from: - Initiating new P2P DMA to other devices - Receiving P2P DMA from other devices - Corrupting or exfiltrating data via P2P paths
The quarantine process blocks until all P2P revocations complete. If any revocation times out (in-flight transactions do not complete), the device is force-reset via PCIe FLR or driver-specific reset, which guarantees termination of all DMA activity.
Security Audit Trail
All P2P authorization and revocation events are logged to the kernel audit subsystem:
Audit Events (logged to /sys/kernel/debug/isle/audit.log):
P2P_MAP: src_device=<id> dst_device=<id> size=<bytes> cap=<token> cgroup=<path>
P2P_UNMAP: mapping=<handle> reason=<explicit|cap_revoke|cgroup_exit|quarantine>
P2P_REVOKE: mapping=<handle> src_device=<id> dst_device=<id> in_flight=<count> timed_out=<bool>
P2P_ACL_ADD: src_device=<id> dst_device=<id> cap=<token> max_bytes=<bytes>
P2P_ACL_REMOVE: src_device=<id> dst_device=<id> reason=<cap_revoke|expired|quarantine>
These audit events are available for security monitoring and forensics.
44. Accelerator Isolation and Scheduling
44.1 Capability-Based Access Control
Every accelerator context is gated by the ISLE capability system:
// Extend existing cap_id constants in isle-driver-sdk/src/capability.rs
pub const ACCEL_COMPUTE: u32 = 0x0100; // Submit compute work
pub const ACCEL_MEMORY: u32 = 0x0101; // Allocate device memory
pub const ACCEL_P2P: u32 = 0x0102; // P2P DMA transfers
pub const ACCEL_PREEMPT: u32 = 0x0103; // Preempt other contexts (admin)
pub const ACCEL_PERF: u32 = 0x0104; // Read performance counters
pub const ACCEL_POWER: u32 = 0x0105; // Change power/clock state (admin)
pub const ACCEL_CONTEXT_RT: u32 = 0x0106; // Create realtime-priority contexts
A container gets a capability token that says "50% compute time on GPU 0, 8GB VRAM limit." The kernel enforces this through the AccelScheduler and memory accounting.
44.2 Cgroup Integration
New cgroup controller: accel.
/sys/fs/cgroup/<group>/accel.devices
# Which accelerators this cgroup can access (by device index)
# Format: "0 1 3" (devices 0, 1, 3 allowed)
/sys/fs/cgroup/<group>/accel.memory.max
# Maximum total device memory across all accelerators (bytes)
# Format: "8589934592" (8GB)
/sys/fs/cgroup/<group>/accel.memory.current
# Current device memory usage (read-only)
/sys/fs/cgroup/<group>/accel.compute.guarantee
# Guaranteed compute bandwidth (microseconds per second, per device)
# Format: "device_idx quota period"
# Example: "0 500000 1000000" (50% of GPU 0 guaranteed)
/sys/fs/cgroup/<group>/accel.compute.max
# Maximum compute bandwidth ceiling
# Format: same as guarantee
/sys/fs/cgroup/<group>/accel.compute.weight
# Relative share of excess compute time (like cpu.weight)
# Default: 100
/sys/fs/cgroup/<group>/accel.priority
# Default priority for contexts created by this cgroup
# "background", "normal", "high"
# ("realtime" requires ACCEL_CONTEXT_RT capability)
/sys/fs/cgroup/<group>/accel.stat
# Usage statistics (read-only):
# compute_time_us <total compute microseconds>
# memory_current <current bytes>
# memory_peak <peak bytes>
# submissions <total command submissions>
# preemptions <times preempted by higher priority>
# faults <device page faults>
# migrations <pages migrated>
44.3 Memory Isolation
Each AccelContext has its own device-side page table (or partition of the device's address space). The kernel ensures:
- Context A cannot access Context B's device memory (separate page tables / address spaces).
- Context A cannot exceed its
accel.memory.maxcgroup limit. - When Context A is destroyed, all its device memory is freed immediately.
- The OOM killer is aware of device memory. If a process is consuming excessive device memory and the device is under pressure, the OOM killer can target it.
Device Memory and the OOM Killer:
Device memory is additive to a process's OOM score — device_bytes is counted as an
RSS equivalent. Before the OOM killer selects a victim, the kernel attempts soft
reclaim: evicting device pages back to CPU RAM (using the migrate_pages vtable call).
Per-device memory pressure is tracked with high/low watermarks; when the high watermark
is breached, the AccelScheduler proactively triggers eviction of cold pages from
device memory to CPU RAM. If a process exceeds its cgroup accel.memory.max limit,
the allocation returns ENOMEM rather than triggering OOM kill — the process must
handle the allocation failure gracefully.
44.4 Compute Time Isolation
The AccelScheduler enforces compute time limits per context:
accel.compute.max: Hard ceiling. If a context has used its quota for this period, its submissions are queued until the next period. (Same semantics ascpu.max.)accel.compute.guarantee: Minimum bandwidth via CBS server. (Same algorithm ascpu.guaranteefrom Section 15.)accel.compute.weight: Proportional sharing of compute time not covered by guarantees. (Same semantics ascpu.weight.)- Preemption: If a high-priority context's CBS server becomes runnable, the scheduler preempts the current context (if hardware supports it). Otherwise, the current submission runs to completion and the high-priority context gets the next slot.
44.5 Device Partitioning
For hardware that supports partitioning (NVIDIA MIG, AMD spatial partitioning):
/// Device partition descriptor.
#[repr(C)]
pub struct AccelPartition {
/// Partition index.
pub index: u32,
/// Compute units assigned to this partition (same semantics as
/// `AccelDeviceInfo::compute_units` — see Section 42.2.3).
pub compute_units: u32,
/// Memory assigned to this partition (bytes).
pub memory_bytes: u64,
/// Unique partition ID (for cgroup binding).
pub partition_id: u64,
}
The device registry models partitions as child nodes of the GPU device node:
pci0000:00
+-- 0000:41:00.0 (GPU, NVIDIA A100)
+-- partition0 (MIG 2g.20gb: 28 SMs, 20GB)
+-- partition1 (MIG 2g.20gb: 28 SMs, 20GB)
+-- partition2 (MIG 3g.40gb: 42 SMs, 40GB)
-- (108 total SMs, 10 reserved for system/L2 cache management; 98 available for MIG partitions)
Each partition is an independently schedulable and isolatable unit. Cgroups can be bound
to specific partitions: echo "0000:41:00.0/partition0" > accel.devices.
44.6 GPU Virtualization Modes
GPUs support multiple virtualization approaches. The kernel accommodates all four without imposing one model.
Note: VFIO passthrough (mode 1) and SR-IOV (mode 2) are general-purpose PCIe virtualization technologies — widely used for NICs, NVMe controllers, FPGAs, and other devices. The general IOMMU/VFIO mechanism is described in Section 7.3.8. This section describes their specific application to GPUs and accelerators. Modes 3 (mdev) and 4 (MIG) are GPU/accelerator-specific.
1. PCIe Passthrough (VFIO):
Entire GPU assigned to a single VM via IOMMU.
Kernel role: IOMMU group management, VFIO device file (/dev/vfio/N).
Performance: near-native. No sharing between VMs.
Use case: dedicated GPU per VM (HPC, gaming).
(This is the same VFIO mechanism used for NIC passthrough, NVMe
passthrough, etc. — the interface is device-agnostic.)
2. SR-IOV (Single Root I/O Virtualization):
Hardware creates Virtual Functions (VFs), each a separate PCIe
device with its own BARs and IOMMU mapping.
Kernel role: enumerate VFs, create device registry nodes for each,
assign VFs to VMs via VFIO. AccelScheduler is NOT involved — each VF
is an independent device scheduled by its own firmware.
GPU support: Intel Data Center GPU Max, some AMD MI-series.
Not supported by: NVIDIA consumer GPUs, most AMD consumer GPUs.
(SR-IOV is more common for NICs — e.g., Intel E810, Mellanox
ConnectX — where each VF provides an independent network interface.
The kernel handles GPU VFs and NIC VFs identically at the PCIe level.)
3. Mediated Passthrough (mdev / vGPU):
Software-defined GPU partitions, managed by the GPU driver.
NVIDIA vGPU (GRID) and Intel GVT-g use this model.
(AMD MxGPU uses SR-IOV, not mdev — see #2 above.)
Kernel role:
- mdev framework: /sys/bus/mdev/ device lifecycle (same as Linux).
- Each mdev appears as a VFIO device to the VM (same API as #1).
- The GPU driver (Tier 1) handles the actual time-slicing and
memory partitioning internally.
- AccelScheduler can enforce per-mdev cgroup limits if the driver
exposes per-mdev utilization via get_utilization().
Trade-off: more flexible than SR-IOV, but relies on driver quality.
4. MIG (Multi-Instance GPU) — NVIDIA A100/H100:
Hardware partitioning into isolated GPU instances, each with
dedicated SMs, memory controllers, and L2 cache.
Modeled as child nodes in device registry (§44.5 above).
Can be combined with VFIO: each MIG instance can be passed through
to a separate VM.
The KABI AccelBase vtable supports all four modes — the kernel sees devices (physical, SR-IOV VF, MIG partition, or mdev) through the same interface. The virtualization mode is a configuration choice, not an architectural difference.
45. In-Kernel Inference Engine
45.1 Rationale
The kernel makes millions of decisions per second: which page to evict, which I/O to schedule next, which task to migrate, whether a health metric is anomalous. Today these decisions use hand-tuned heuristics. Machine learning can do better for pattern-dependent decisions.
45.2 Constraints
In-kernel inference is not general-purpose ML. It must satisfy:
- Deterministic execution time: Every inference call must complete within a bounded
number of cycles. No data-dependent loops, no dynamic allocation. For multi-layer
models (
TinyNeuralNet), inference is preemptible at layer boundaries: the inference loop checksneed_resched()between layers and yields if a higher-priority task or interrupt is pending. This ensures CPU inference (which may run for ~500ns-5μs) does not cause scheduling latency spikes on latency-sensitive systems. - No floating point: Kernel code must not use FPU/SIMD registers (they belong to userspace). All computation uses integer/fixed-point arithmetic.
- Bounded memory: Model size is fixed at load time. No dynamic allocation during inference.
- Safe fallback: If the model produces nonsensical output, the system falls back to a traditional heuristic. The model is advisory, never authoritative for safety-critical decisions.
- Offline training: Models are trained in userspace (with full floating-point, GPU acceleration, unlimited time). Only the trained, quantized model is loaded into the kernel.
45.3 Supported Model Types
// isle-core/src/inference/mod.rs (kernel-internal)
pub enum KernelModelType {
/// Decision tree / random forest.
/// Bounded depth (max 32), bounded node count (max 64K).
/// Inference = walk tree from root to leaf, O(depth) comparisons.
/// Best for: classification decisions with <50 features.
DecisionTree,
/// Quantized lookup table.
/// Input is quantized to N bits, output is a table lookup.
/// O(1) inference. Best for: 1-2 dimensional functions.
LookupTable,
/// Quantized linear model.
/// weights: [i16; N], bias: i32, threshold: i32.
/// Inference = dot product + compare. O(N) multiplications.
/// Best for: simple binary classification / regression.
LinearModel,
/// Quantized tiny neural network.
/// Fixed architecture: input → hidden(s) → output.
/// All weights INT8, activations INT8, accumulation INT32.
/// Maximum: 4 layers, 256 neurons per layer.
/// Inference time bounded by architecture constants.
TinyNeuralNet,
}
45.4 Model Loading and Lifecycle
Models are trained in userspace and loaded into the kernel via a sysfs interface:
/sys/kernel/isle/inference/models/
page_prefetch/
model.bin # Write: load model binary. Read: model metadata.
active # "1" to enable, "0" to disable, "heuristic" to fallback
accuracy # Read: online accuracy estimate (correct predictions / total)
latency_ns # Read: average inference latency
invocations # Read: total inference calls
fallbacks # Read: times fell back to heuristic
io_scheduler/
model.bin
active
accuracy
latency_ns
...
fma_anomaly/
model.bin
active
...
// isle-core/src/inference/model.rs (kernel-internal)
pub struct KernelModel {
/// Model type determines the inference algorithm.
pub model_type: KernelModelType,
/// Model parameters (weights, tree nodes, lookup table).
/// Allocated as a single contiguous block, fixed at load time.
pub params: &'static [u8],
/// Input feature count.
pub input_features: u32,
/// Output count (1 for regression, N for N-class classification).
pub outputs: u32,
/// Maximum inference latency in nanoseconds (pre-computed from model size).
pub max_latency_ns: u64,
/// Whether this model is currently active.
pub active: AtomicBool,
/// Online accuracy tracking.
pub stats: ModelStats,
}
pub struct ModelStats {
pub total_invocations: AtomicU64,
pub correct_predictions: AtomicU64, // When ground truth is available
pub total_latency_ns: AtomicU64,
pub fallback_count: AtomicU64,
}
45.5 Use Cases
Use case 1: Learned Page Prefetching
Replace Linux's simple sequential readahead with a learned prefetcher:
Input features (per page fault):
- Faulting virtual address (quantized to page range)
- Previous N fault addresses (pattern detector)
- Process ID (different processes have different patterns)
- Time since last fault
- Memory region type (heap, stack, mmap, file-backed)
Output:
- Next K pages to prefetch (page offsets relative to current fault)
Model type: TinyNeuralNet (2 hidden layers, 128 neurons each, INT8)
Inference time: ~500ns
Benefit: 20-40% reduction in page fault rate for workloads with learnable patterns
Fallback: Standard sequential readahead (Linux default)
Ground truth for online accuracy: track whether prefetched pages are actually accessed within a time window. If accuracy drops below threshold, fall back to heuristic.
Use case 2: I/O Scheduling
Optimize I/O queue ordering based on learned workload patterns:
Input features (per I/O request):
- LBA (sector address, quantized)
- Request size
- Read vs write
- Queue depth
- Device utilization
- Recent I/O pattern (last N requests, summarized)
Output:
- Priority score (determines queue ordering)
Model type: DecisionTree (depth 16, ~4K nodes)
Inference time: ~200ns
Benefit: 5-15% IOPS improvement for mixed workloads
Fallback: mq-deadline heuristic
Use case 3: FMA Anomaly Detection
Detect correlated hardware degradation that threshold rules miss:
Input features (per health telemetry window):
- ECC error rate (current window)
- ECC error rate (historical baseline)
- Temperature delta from baseline
- PCIe correctable error rate
- SMART attribute trends (multiple attributes)
- Device age / power-on hours
Output:
- Anomaly score (0.0 - 1.0 in fixed-point)
- Predicted failure class (memory, storage, bus, thermal)
Model type: DecisionTree (depth 20, ~8K nodes)
Inference time: ~300ns
Benefit: Detect multi-signal degradation patterns that simple thresholds miss
Fallback: Threshold rules (Section 39.5)
Use case 4: Accelerator Memory Migration Policy
Decide which pages to migrate between CPU and GPU memory:
Input features (per page):
- Access count on device (recent window)
- Access count on CPU (recent window)
- Time since last access
- Page size
- Current device memory pressure
Output:
- Migrate to device / keep on CPU / evict from device
Model type: LinearModel (fast, simple)
Inference time: ~50ns per page
Benefit: Better migration decisions than fixed LRU
Fallback: LRU eviction
45.6 Safety Guarantees
The primary safety mechanism is mandatory structural validation at model load time (Section 45.7). The load-time validator statically proves that every accepted model terminates in bounded time before it is ever invoked:
- Decision trees: Verify tree depth <=
max_depthand that the tree is acyclic (DAG check). A tree of bounded depth with no cycles has a statically known maximum number of comparisons per inference. - Linear models: Verify the number of input features and outputs matches the declared dimensions. Inference is a single matrix-vector multiply with bounded operations.
- Neural networks: Verify the layer count is fixed (no recurrent connections that
could loop), and that the total multiply-accumulate (MAC) operations <=
max_ops. A feedforward network with bounded layers and bounded dimensions has a statically known operation count.
Any model that cannot be statically proven to terminate in bounded time is rejected at load time and never reaches the inference path. This is the preemptive guarantee.
The post-hoc cycle check in infer_safe below remains as a defense-in-depth
assertion — if the load-time validator has a bug and admits a model that runs longer
than expected, the cycle check catches it and disables the model. But the cycle check
is not the primary safety mechanism, because it measures cycles after run_inference
completes; it cannot interrupt an infinite loop.
// isle-core/src/inference/safety.rs
/// Every model invocation goes through this wrapper.
/// The primary termination guarantee comes from load-time structural
/// validation (Section 45.7). This wrapper provides defense-in-depth:
/// post-hoc cycle checking, output validation, and fallback.
pub fn infer_safe<const MAX_NS: u64>(
model: &KernelModel,
input: &[i32],
output: &mut [i32],
fallback: impl FnOnce(&[i32], &mut [i32]),
) {
// 1. Check model is active
if !model.active.load(Ordering::Acquire) {
fallback(input, output);
model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
return;
}
// 2. Run inference with post-hoc cycle measurement.
// Termination is guaranteed by load-time structural validation
// (bounded tree depth / bounded layer count / bounded MAC ops).
// The cycle check is defense-in-depth only.
//
// For TinyNeuralNet models, run_inference() checks need_resched()
// between layers. If a higher-priority task is pending, it yields
// and resumes after rescheduling. This adds ~5-20ns per layer
// boundary but prevents CPU inference from causing scheduling
// latency spikes. DecisionTree and LinearModel complete in <200ns
// and do not need preemption checks.
//
// Because run_inference() may yield to the scheduler (via
// need_resched()), wall-clock TSC cycles include time spent
// sleeping/waiting — not actual model execution time. On a busy
// system, scheduler latency during a yield could push the
// wall-clock measurement far beyond MAX_NS even though the model
// executed correctly within its budget. To avoid false positives,
// we track whether a yield occurred during inference and skip the
// post-hoc cycle check in that case. The load-time structural
// validator (Section 45.7) is the primary termination guarantee;
// the cycle check is defense-in-depth that is only meaningful
// when the model ran without interruption.
let mut yielded = false;
let start = arch::current::cpu::read_cycle_counter();
let result = model.run_inference(input, output, &mut yielded);
let elapsed = arch::current::cpu::read_cycle_counter() - start;
// 3. Defense-in-depth: check execution time was within bounds.
// This should never fire if the load-time validator is correct.
// If it does fire, it means the validator has a bug — disable
// the model and fall back to the heuristic.
// Skip the check if a yield occurred during inference, because
// the elapsed TSC cycles include scheduler/sleep time that is
// not attributable to the model. The load-time validator remains
// the primary safety mechanism regardless.
if !yielded && elapsed > tsc_cycles_from_ns(MAX_NS) {
model.active.store(false, Ordering::Release);
fallback(input, output);
log_warning!("inference model {} exceeded time bound (validator bug?), disabled", model.name);
return;
}
// 4. Sanity-check output (model-specific validation)
if !model.validate_output(output) {
fallback(input, output);
model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
return;
}
// 5. Update statistics
model.stats.total_invocations.fetch_add(1, Ordering::Relaxed);
model.stats.total_latency_ns.fetch_add(
tsc_ns_from_cycles(elapsed), Ordering::Relaxed
);
}
Execution time bounding: Static analysis (infer_safe) cannot prove termination
for arbitrary GPU kernels (this is equivalent to the halting problem). Instead, ISLE
uses a hardware watchdog approach:
- Every accelerator dispatch includes a max_execution_us: u64 timeout (default:
5,000,000 μs = 5 seconds for compute kernels, 16,667 μs = 1/60th second for
graphics).
- The driver programs the GPU's hardware watchdog timer (GFX_TIMEOUT on AMD,
TDR on NVIDIA via VGPU, or software timer for accelerators without hardware
watchdog) before submitting the kernel.
- If the kernel exceeds max_execution_us, the GPU engine is reset (per-engine
reset, not full GPU reset where hardware supports it) and the dispatch returns
ETIMEDOUT to the submitter. Other engines and other contexts are not affected.
- infer_safe provides a BEST-EFFORT execution time estimate when possible (e.g.,
bounded loop counts × estimated per-iteration cost), which is used to set a
tighter watchdog timeout. Kernels that infer_safe cannot analyze use the
default timeout.
Hardware-accelerated inference timeout (NPU/DSP offload):
The infer_safe wrapper above covers CPU-side inference only. When a TinyNeuralNet
model is offloaded to a hardware accelerator (e.g., an NPU or DSP via the AccelBase
KABI), the post-hoc cycle counter cannot bound execution: once a command buffer is
submitted to the device, the CPU cannot observe or interrupt mid-execution (for
non-preemptible devices, see Section 42.2.4).
For hardware-offloaded inference, the scheduler enforces a hardware command timeout
using the max_execution_us field in AccelContextLimits (Section 42.2.3). The
inference subsystem creates a dedicated AccelContext for each model with a tight
timeout derived from KernelModel::max_latency_ns:
// isle-core/src/inference/hw_offload.rs (kernel-internal)
/// Submit an inference request to a hardware accelerator with a hard timeout.
/// If the device does not complete within `model.max_latency_ns * HW_TIMEOUT_MULTIPLIER`,
/// the AccelScheduler cancels the submission and `infer_hw` falls back to the CPU path.
///
/// HW_TIMEOUT_MULTIPLIER = 4 — allows for device load variance while still bounding
/// worst-case lock-up. A misbehaving model cannot hold the accelerator indefinitely.
pub fn infer_hw(
model: &KernelModel,
ctx: &AccelContextHandle,
input: &[i32],
output: &mut [i32],
fallback: impl FnOnce(&[i32], &mut [i32]),
) {
// 1. Submit inference command buffer to hardware.
// AccelContextLimits::max_execution_us is set to
// (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 at context creation.
// The AccelScheduler arms a kernel timer on submission; if the device does not
// complete before the timer fires, it calls preempt_context() or, for
// non-preemptible devices, marks the submission Timeout at the next boundary.
let submit_result = accel_submit_inference(ctx, model, input);
if submit_result.is_err() {
fallback(input, output);
model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
return;
}
// 2. Wait for completion with the same timeout bound.
// poll_completion() returns Timeout if max_execution_us elapsed.
match accel_poll_completion(ctx, model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) {
AccelCompletionStatus::Success => { /* copy results from device buffer to output */ }
AccelCompletionStatus::Timeout | AccelCompletionStatus::Error => {
// Device did not complete in time — fall back to CPU heuristic.
// Disable hardware offload for this model after repeated timeouts.
model.hw_timeout_count.fetch_add(1, Ordering::Relaxed);
if model.hw_timeout_count.load(Ordering::Relaxed) > HW_TIMEOUT_DISABLE_THRESHOLD {
model.hw_offload_enabled.store(false, Ordering::Release);
log_warning!("inference model {} exceeded hw timeout {} times, disabling hw offload",
model.name, HW_TIMEOUT_DISABLE_THRESHOLD);
}
fallback(input, output);
model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
}
AccelCompletionStatus::Preempted => {
// Higher-priority work displaced the inference — fall back to CPU.
fallback(input, output);
model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
}
}
}
The AccelContext for in-kernel inference is created at model load time with:
- max_execution_us = (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000
- priority = AccelPriority::Background (inference is advisory; never starves user workloads)
- max_memory_bytes = model parameter size + fixed scratch buffer (no dynamic allocation)
This ensures that a misbehaving model or faulty NPU firmware cannot lock the accelerator indefinitely. The kernel timer in the AccelScheduler provides the hard bound; the CPU fallback path ensures the kernel continues operating correctly even if the hardware accelerator is unresponsive.
45.6a Adversarial Robustness
Section 45.6 addresses runtime safety (cycle budgets, fallback, output clamping). This section addresses a different threat: adversarial inputs — workload patterns deliberately crafted to exploit learned kernel models.
Threat model — an unprivileged attacker crafts memory access patterns, I/O sequences, or scheduling behavior designed to: 1. Degrade performance: trick the page prefetcher into evicting hot pages, or trick the I/O scheduler into making sub-optimal ordering decisions 2. Denial of service: cause the model to consistently produce worst-case outputs, degrading system throughput for co-tenants on the same machine 3. Information leakage: infer information about other processes' behavior by observing how the model's decisions change in response to probing inputs
Why kernel models are partially resistant — unlike ML models in adversarial ML research (image classifiers, NLP), kernel models operate on aggregate statistics (page fault rate over the last 1000 faults, I/O queue depth histogram), not on raw inputs from a single source. An attacker controls only their own process's behavior, which is one signal among many feeding into the model.
Mitigations:
-
Per-process model state isolation: page prefetch and I/O scheduling models maintain per-process (per-cgroup) input features. Attacker process A's access pattern cannot directly influence model decisions for victim process B. The model sees A's features and B's features independently.
-
Output clamping (already in Section 45.6): model output is clamped to a safe range. Even the worst possible model output (prefetch the maximally wrong pages, schedule I/O in the worst possible order) produces bounded degradation — the system falls back to LRU/FIFO behavior, which is the same as having no model at all. The adversary's best attack reduces performance to the no-model baseline, not below it.
-
Anomaly detection on model inputs: the
infer_safewrapper tracks input feature distributions. If input features drift significantly from the training distribution (measured by simple range checks and mean/variance tracking), the model is automatically disabled for that process/cgroup and the heuristic fallback activates. This prevents an attacker from driving the model into an untrained region of the input space. -
Model-decision audit tracing: all model decisions are available via stable tracepoints (Section 40). Security-sensitive deployments can monitor for anomalous model behavior (e.g., prefetch hit rate dropping below a threshold for a specific cgroup) and trigger investigation.
-
Side-channel hardening: the model's decision is not directly observable by unprivileged processes. An attacker cannot call "what did the prefetcher decide?" — they can only observe timing (whether a page fault occurred or not). This is equivalent to the existing cache side-channel problem, not a new attack surface. Standard cache partitioning mitigations (CAT, MBA) apply equally to model-influenced cache behavior.
Accepted risk — a sophisticated attacker with co-tenant access can degrade their own performance or (with difficulty) slightly degrade co-tenant prefetch accuracy, but cannot cause worse-than-baseline behavior (output clamping), cannot crash the kernel (cycle watchdog), and cannot corrupt other processes' data (model outputs are advisory — they influence which pages to prefetch, not which pages are accessible). This risk profile is comparable to existing cache pollution attacks, which are accepted in shared environments.
45.6b Fallback Mode Safety Specification
When in-kernel inference encounters errors, timeouts, or hardware failures, the system must transition to a safe fallback mode without compromising system integrity. This section specifies the safety guarantees that MUST be maintained even when fallback is active.
45.6b.1 Fallback Trigger Conditions
Fallback mode is triggered by any of the following conditions:
| Condition | Scope | Automatic Recovery |
|---|---|---|
| Model disabled (active=false) | Per-model | Yes, when admin re-enables |
| Cycle budget exceeded | Per-model | No, requires admin review |
| Hardware offload timeout (NPU/DSP) | Per-model + per-device | Yes, after cooldown period |
| Output validation failure | Per-model, per-invocation | Yes, model auto-disables after threshold |
| Input distribution anomaly | Per-cgroup, per-model | Yes, after distribution normalizes |
| Accelerator device failure | Per-device | No, requires device reset |
| Model binary signature failure | Per-model | No, requires valid signed model |
Critical invariant: Fallback mode NEVER disables safety-critical kernel checks. The inference engine provides advisory decisions only (page prefetch hints, I/O ordering hints). The following safety guarantees remain active unconditionally:
- Memory access validation (page tables, permission checks)
- Capability checks for all resource accesses
- I/O command validation (DMA bounds, register ranges)
- Interrupt masking and priority enforcement
- Scheduler fairness and preemption guarantees
- Watchdog timers for all hardware operations
45.6b.2 Fallback Scope and Isolation
Fallback is NEVER system-wide. The isolation granularity is:
-
Per-model: Each inference model (page_prefetch, io_scheduler, etc.) operates independently. A failure in one model does not affect others.
-
Per-cgroup for input anomaly: If a specific cgroup's input features drift outside the training distribution, fallback activates for that cgroup only. Other cgroups continue using the model normally.
-
Per-device for hardware offload: If NPU A times out, models offloaded to NPU A fall back to CPU inference. NPU B continues operating normally.
// isle-core/src/inference/fallback.rs (kernel-internal)
/// Fallback state for a single model instance.
/// This is tracked independently per (model, cgroup) pair.
pub struct FallbackState {
/// Why fallback is active (None = not in fallback)
pub reason: Option<FallbackReason>,
/// Timestamp when fallback mode was entered (monotonic clock)
pub entered_at: Option<u64>,
/// Number of consecutive fallback invocations
pub consecutive_count: u64,
/// Maximum consecutive fallbacks before escalation
pub escalation_threshold: u64,
}
/// Reasons for entering fallback mode
pub enum FallbackReason {
/// Model explicitly disabled by admin
ModelDisabled,
/// Cycle budget exceeded (validator bug suspected)
CycleBudgetExceeded,
/// Hardware accelerator timeout
HwTimeout { device_id: u32, timeout_us: u64 },
/// Output failed model-specific validation
OutputValidationFailed,
/// Input features outside training distribution
InputDistributionAnomaly {
feature_index: u32,
observed_value: i32,
expected_range: (i32, i32),
},
/// Accelerator device reported error
DeviceError { device_id: u32, error_code: u32 },
}
45.6b.3 Fallback Duration and Escalation
Fallback mode has bounded duration with automatic escalation:
| Duration | Action |
|---|---|
| 0-60 seconds | Automatic recovery: continue using heuristic fallback, periodically retry model |
| 60-300 seconds | Escalation: emit kernel warning event, log to audit trail |
| 300+ seconds | Critical escalation: emit high-priority alert to system management daemon |
| 3600+ seconds (configurable) | Model auto-disables: set active=false, require admin to re-enable |
/// Fallback duration thresholds (configurable via sysfs)
pub const FALLBACK_RETRY_INTERVAL_SECS: u64 = 10; // Retry model every 10s
pub const FALLBACK_WARNING_SECS: u64 = 60; // Log warning after 60s
pub const FALLBACK_ESCALATION_SECS: u64 = 300; // Alert after 5 minutes
pub const FALLBACK_AUTO_DISABLE_SECS: u64 = 3600; // Disable model after 1 hour
Recovery attempts: While in fallback mode, the kernel periodically attempts to
use the model again (every FALLBACK_RETRY_INTERVAL_SECS). If the model succeeds
three consecutive times, fallback mode exits and normal operation resumes. This
handles transient hardware glitches without operator intervention.
Escalation path:
Fallback entered
↓
[Retry every 10s, up to 60s]
↓ (still failing)
Log warning: "Model X in fallback for 60s, reason=Y"
↓
[Continue retries, up to 300s]
↓ (still failing)
Emit event: isle_inference_fallback(model="X", duration_s=300, reason="Y")
↓
[Continue retries, up to 3600s]
↓ (still failing)
Auto-disable model, emit critical event
↓
Require admin to re-enable via sysfs write
45.6b.4 Admin Intervention Requirements
Certain fallback conditions require explicit admin action to exit fallback mode:
Requires admin intervention: - Model disabled due to cycle budget exceeded (indicates validator bug) - Accelerator device failure (requires device reset or replacement) - Model binary signature verification failure
Automatic recovery allowed: - Hardware timeout (transient NPU load) - Input distribution anomaly (workload shifted temporarily) - Output validation failure (if under threshold)
The sysfs interface exposes recovery control:
/sys/kernel/isle/inference/models/page_prefetch/
fallback_reason # Read: current fallback reason, empty if not in fallback
fallback_duration_ms # Read: milliseconds since fallback entered
fallback_auto_recover # Write: "1" to attempt immediate recovery, "0" to stay in fallback
require_admin_reset # Read: "1" if admin action required to exit fallback
To exit a fallback state requiring admin intervention:
# Check current state
cat /sys/kernel/isle/inference/models/page_prefetch/fallback_reason
# Output: "cycle_budget_exceeded"
# This condition requires admin review
cat /sys/kernel/isle/inference/models/page_prefetch/require_admin_reset
# Output: "1"
# After reviewing logs and validating the model is correct, admin resets:
echo 1 > /sys/kernel/isle/inference/models/page_prefetch/active
45.6b.5 Audit Logging
All fallback state transitions are logged to the kernel audit subsystem:
/// Audit event emitted on fallback state changes
pub struct FallbackAuditEvent {
/// Timestamp (monotonic nanoseconds since boot)
pub timestamp_ns: u64,
/// Model identifier (e.g., "page_prefetch")
pub model_name: &'static str,
/// Cgroup ID (0 if fallback is model-wide)
pub cgroup_id: u64,
/// Device ID (0 if not device-related)
pub device_id: u32,
/// Event type
pub event_type: FallbackEventType,
/// Reason for the event
pub reason: FallbackReason,
}
pub enum FallbackEventType {
/// Entered fallback mode
Entered,
/// Automatic recovery attempt (retrying model)
RecoveryAttempt,
/// Successfully recovered, exiting fallback
Recovered,
/// Escalation: warning logged
EscalationWarning,
/// Escalation: critical alert emitted
EscalationCritical,
/// Model auto-disabled due to prolonged fallback
AutoDisabled,
/// Admin re-enabled model after manual review
AdminReset,
}
Audit log entries (visible via tracefs and forwarded to system audit daemon):
[12345.678901] ISLE-INFERENCE: model=page_prefetch event=entered reason=hw_timeout device=4 timeout_us=2000
[12355.678901] ISLE-INFERENCE: model=page_prefetch event=recovery_attempt
[12355.679001] ISLE-INFERENCE: model=page_prefetch event=recovered
[12405.678901] ISLE-INFERENCE: model=io_scheduler event=entered reason=input_anomaly cgroup=1234 feature=2
[12465.678901] ISLE-INFERENCE: model=io_scheduler event=escalation_warning duration_s=60
Audit retention: Fallback audit events are retained for a minimum of 30 days (configurable) to support post-incident analysis. Security-sensitive deployments MAY configure longer retention or external log forwarding.
45.6b.6 Fallback Heuristic Specification
When fallback is active, the system MUST use a well-defined heuristic that provides correct (if suboptimal) behavior:
| Model | Fallback Heuristic |
|---|---|
| Page prefetch | Sequential readahead: next 4 pages from current fault address |
| I/O scheduling | mq-deadline: order by LBA, 50ms deadline for reads, 500ms for writes |
| FMA anomaly | Static thresholds: trigger alert if any single metric exceeds hard limit |
| Memory migration | Access-count based: migrate if device_access_count > cpu_access_count × 2 |
| Power budget | Equal distribution: allocate total_power / num_domains to each domain |
Critical safety property: Fallback heuristics MUST NOT bypass any safety checks. For example, the sequential readahead fallback still validates that: - The target pages fall within the process's virtual address space - The underlying physical pages are accessible (not protected, not corrupt) - The readahead does not exceed the process's memory limits
The fallback heuristic is simply a decision algorithm for which pages to prefetch or which I/O to prioritize — it does NOT bypass the memory management or I/O subsystem's safety validation layers.
45.7 Model Binary Format
In-kernel models use a simple binary format for loading:
Header (4790 bytes — larger than original 256 bytes due to ML-DSA signature):
magic: u32 = 0x49534C45 ("ISLE")
version: u32 = 1
model_type: u32 (KernelModelType discriminant)
input_features: u32
outputs: u32
param_size: u64 (bytes of parameter data following header)
max_latency_ns: u64 (pre-computed worst-case inference time)
sha256_hash: [u8; 32] (SHA-256 over entire model file: header fields above
[magic through max_latency_ns] concatenated with parameter
data — integrity check covering both structure and weights)
ed25519_sig: [u8; 64] (Ed25519 signature over entire model — authentication)
mldsa_sig_len: u16 (actual ML-DSA signature length; ML-DSA-65 = 3309, ML-DSA-87 = 4627)
mldsa_sig: [u8; 4627] (ML-DSA signature over entire model — post-quantum authentication;
sized for largest variant ML-DSA-87; actual length in mldsa_sig_len)
_reserved: [u8; 29] (must be zero)
Parameter data: [u8; param_size]
Model binaries are verified using the same hybrid signature scheme as kernel modules
(Section 22). The SHA-256 hash covers the entire model file (header + parameters),
providing integrity over both the model structure and weights — an attacker cannot
modify the header to reinterpret parameter data without invalidating the hash. The
Ed25519 + ML-DSA signatures over the entire model provide authentication. Both
signatures must verify against the kernel's
built-in model signing public keys. Unsigned or incorrectly signed models are rejected
unless the system is booted with accel.allow_unsigned=1 (disabled by default, requires
CAP_ACCEL_ADMIN). This prevents an attacker from loading arbitrary model binaries
even if they can construct one with a matching SHA-256 hash.
Load-time validation:
- Verify magic bytes (
0x49534C45). - Check
param_sizeagainst maximum allowed model size (configurable, default 1 MB). - Verify SHA-256 hash of entire model file (header fields
magicthroughmax_latency_nsconcatenated with parameter data) matchessha256_hash. - Signature verification — verify both Ed25519 and ML-DSA signatures over the
entire model file (header + parameters) against the kernel's built-in model signing
public keys. The ML-DSA signature uses the length indicated by
mldsa_sig_len. Reject if either signature fails (unlessaccel.allow_unsigned=1). - Structural termination proof — the validator statically proves bounded execution based on model type:
- Decision trees: Verify tree depth <=
max_depthand that the tree is acyclic (DAG check via topological sort). Reject if any cycle is found or depth exceeds the configured maximum. - Linear models: Verify input/output dimensions match the declared
input_featuresandoutputsfields. A single matrix-vector multiply is inherently bounded. - Neural networks: Verify the layer count is fixed (no recurrent connections),
and that total multiply-accumulate operations <=
max_ops(computed from layer dimensions). Reject any model with recurrent or self-referencing layers. Any model that cannot be statically proven to terminate in bounded time is rejected. This is the primary safety mechanism — see Section 45.6. - Compute worst-case latency from the proven model structure (tree depth, layer count, MAC operations) and reject if it exceeds the target subsystem's latency budget.
Atomic replacement: Models are replaced via double-buffer. The new model is loaded into a shadow slot while the current model continues serving. Once the new model is fully loaded and validated, an atomic pointer swap activates it. The old model is freed after a grace period (RCU-style) to ensure no in-flight inference uses stale data.
45.8 Model Drift Detection and Retraining Pipeline
Models trained on one workload profile may degrade when hardware changes (new storage devices with different latency characteristics), workload shifts (database server repurposed as a build server), or system configuration changes (memory added, CPUs hotplugged). This section addresses how ISLE detects and responds to model drift.
Online accuracy tracking — every model maintains a correct_predictions counter
(Section 45.4 ModelStats). For the page prefetcher, a "correct prediction" means
the prefetched page was actually accessed within a configurable window (~100ms). For the
I/O scheduler, it means the predicted optimal ordering actually reduced tail latency.
The kernel continuously computes a rolling accuracy rate:
accuracy = correct_predictions / total_invocations (over last 60 seconds)
Drift detection thresholds — the model is automatically disabled (falling back to heuristic) when accuracy drops below a configurable threshold:
/sys/kernel/isle/inference/models/page_prefetch/
accuracy # Current rolling accuracy (read-only)
accuracy_threshold # Minimum acceptable accuracy (read/write, default: 0.60)
drift_detected # "1" if accuracy < threshold (read-only)
auto_disable # "1" to auto-disable on drift (read/write, default: "1")
When drift_detected transitions to 1:
1. Model is disabled, heuristic fallback activates immediately
2. A kernel event is emitted: isle_inference_drift(model="page_prefetch", accuracy=0.52)
3. The event is visible via stable tracepoints (Section 40) and sysfs
Retraining pipeline — model retraining is deliberately out-of-kernel. The kernel collects training data; userspace trains models; the kernel loads the result:
┌─────────────┐ trace data ┌──────────────┐ model.bin ┌─────────────┐
│ ISLE Kernel │ ─────────────→ │ Userspace │ ───────────→ │ ISLE Kernel │
│ (inference) │ sysfs/tracefs │ Trainer │ sysfs write │ (inference) │
│ │ │ (isle-mltool)│ │ │
│ Emits: │ │ - Reads trace │ │ Atomic swap: │
│ - features │ │ - Trains model│ │ new model │
│ - outcomes │ │ - Quantizes │ │ replaces old │
│ - accuracy │ │ - Validates │ │ │
└─────────────┘ └──────────────┘ └─────────────┘
The isle-mltool userspace utility (shipped with ISLE) automates the pipeline:
1. Reads training features from tracefs ring buffer
2. Trains a decision tree or lookup table using the collected features + outcomes
3. Quantizes to int8/int16
4. Validates against a holdout set
5. Writes the new model to sysfs (atomic replacement)
This can run as a cron job, a systemd timer, or triggered by the drift detection event. The kernel never trains a model itself — training requires floating-point, unbounded memory, and access to training libraries, all of which belong in userspace.
Why not online learning in-kernel? — Online learning (updating model weights incrementally as new data arrives) would eliminate the retraining round-trip. However: - Online learning requires floating-point arithmetic (kernel uses integer-only inference) - Convergence guarantees are weaker (adversarial inputs could steer the model — see Section 45.6a) - Deterministic execution is impossible to guarantee with continuous weight updates - The kernel's cycle budget (~500-5000 cycles per inference) leaves no room for gradient computation
The offline-train / online-infer split is deliberate: the kernel is a fast, dumb inference engine; intelligence lives in userspace where it can be debugged, validated, and rolled back.
45.9 Tier 2 Inference Services
Section 45 defines in-kernel inference: tiny integer-only models running in the kernel's cycle budget (~500-5000 cycles) for per-decision hot paths. But many kernel optimization decisions don't need per-I/O latency — they're slow-path, strategic decisions made every few seconds or minutes. These can benefit from much more powerful models running in Tier 2 userspace drivers.
The architectural fit — Tier 2 drivers (Section 3) run as isolated userspace processes communicating with isle-core via ring buffers. They can: - Use floating-point and SIMD (no kernel FP restrictions) - Allocate unlimited heap memory - Link against full ML frameworks (ONNX Runtime, XGBoost, scikit-learn, PyTorch) - Access GPU/NPU accelerators via isle-accel (Section 42) - Crash without affecting the kernel (full process isolation)
This makes Tier 2 the natural home for advanced AI capabilities that exceed what the in-kernel inference engine can provide.
Tier 2 inference service model:
┌─────────────────────────────────────────────────────────┐
│ ISLE Core (Ring 0) │
│ │
│ In-kernel inference (45): Tier 2 IPC client: │
│ - Decision trees (fast) - Ring buffer query → │
│ - Lookup tables (fast) - ← Ring buffer reply │
│ - ~500-5000 cycles - ~50-200 μs round-trip│
│ - Hot path (per-I/O) - Warm path (periodic) │
│ │
└──────────────────────┬──────────────────────────────────┘
│ Ring buffer IPC
│ (shared memory, zero-copy)
┌──────────────────────▼──────────────────────────────────┐
│ Tier 2 Inference Service (Userspace process) │
│ │
│ - Full FP, GPU access, unlimited memory │
│ - GBM / Random Forest / small transformer models │
│ - Online learning (incremental weight updates) │
│ - Training data ingest from kernel ring buffer │
│ - Model export back to in-kernel engine │
│ │
└─────────────────────────────────────────────────────────┘
Use cases for Tier 2 inference services:
| Decision | Frequency | In-kernel (45) | Tier 2 service |
|---|---|---|---|
| Page prefetch (which pages next?) | Per fault (~μs) | Decision tree | N/A — too latency-sensitive |
| I/O scheduling (reorder queue) | Per I/O (~μs) | Lookup table | N/A — too latency-sensitive |
| NUMA rebalancing (move pages?) | Every ~10s | Too complex | GBM model: predict migration benefit from memory access histograms |
| Compression algorithm selection | Per cgroup, every ~30s | N/A | Random forest: select LZ4/Zstd/none based on cgroup's page entropy distribution |
| Swap tier selection (Section 47) | Every ~5s | N/A | Regression model: predict optimal local-vs-remote swap ratio from RDMA latency measurements |
| Anomaly detection (FMA, Section 39) | Every ~1s | N/A | Autoencoder or isolation forest: detect anomalous device behavior patterns |
| Power budget optimization (Section 49) | Every ~1s | N/A | RL agent: learn optimal RAPL power limits given current workload mix |
| Driver crash prediction | Every ~5s | N/A | Time-series model: predict imminent driver failure from error rate trends |
Online learning — safely in Tier 2:
The in-kernel inference engine cannot do online learning (Section 45.8 explains why: no floating-point, no convergence guarantees, adversarial risk). But a Tier 2 service can, because:
- Full FP available: Tier 2 runs in userspace with standard
libm, SSE/AVX, GPU - Crash-safe: if the online learning algorithm diverges or crashes, the Tier 2 process restarts. The kernel falls back to its in-kernel heuristic. No kernel impact.
- Adversarial resistance: the Tier 2 service can implement proper input validation, outlier detection, and gradient clipping — techniques too expensive for in-kernel but standard practice in ML engineering
- Auditable: the Tier 2 service's model weights, training data, and decisions can be logged, inspected, and rolled back using standard userspace debugging tools
The online learning loop:
1. Kernel emits training features to Tier 2 service via ring buffer
(page fault patterns, I/O latencies, scheduling decisions + outcomes)
2. Tier 2 service updates model incrementally (mini-batch SGD, online RF, etc.)
3. Periodically, Tier 2 service quantizes a snapshot of its model to int8/int16
4. Tier 2 service writes the quantized model to the kernel via sysfs
(atomic model replacement — Section 45.4)
5. In-kernel inference engine picks up the new model instantly
6. Kernel continues fast, deterministic, integer-only inference with fresh weights
This gives the system continuous adaptation — the in-kernel model is refreshed every few minutes with weights learned from the actual running workload — without any of the risks of in-kernel online learning. The Tier 2 service absorbs all the complexity (floating-point, convergence, validation), and the kernel sees only a stream of pre-validated, quantized model snapshots.
Tier 2 inference service KABI:
/// KABI interface for Tier 2 inference services.
/// Registered via the standard Tier 2 driver mechanism.
#[repr(C)]
pub struct InferenceServiceVTable {
pub vtable_size: u64,
pub version: u32,
/// Kernel sends a query to the inference service.
/// Returns immediately — result arrives asynchronously via ring buffer.
///
/// # Safety
/// Caller must ensure `service` is a valid context pointer and
/// `features` points to at least `feature_count` valid i32 values.
pub submit_query: unsafe extern "C" fn(
service: *mut ServiceContext,
query_type: InferenceQueryType,
features: *const i32,
feature_count: u32,
) -> KabiResult,
/// Kernel retrieves the latest model snapshot (int8/int16 quantized).
/// Called periodically to update the in-kernel inference engine.
///
/// # Safety
/// Caller must ensure `service` is a valid context pointer and
/// `buffer` points to at least `buffer_len` bytes of writable memory.
pub get_model_snapshot: unsafe extern "C" fn(
service: *mut ServiceContext,
model_type: KernelModelType,
buffer: *mut u8,
buffer_len: usize,
) -> KabiResult,
/// Kernel reports ground truth (outcome of a previous prediction).
/// Enables online accuracy tracking in the Tier 2 service.
///
/// # Safety
/// Caller must ensure `service` is a valid context pointer.
pub report_outcome: unsafe extern "C" fn(
service: *mut ServiceContext,
query_id: u64,
outcome: i32,
) -> KabiResult,
}
#[repr(u32)]
pub enum InferenceQueryType {
/// Should we rebalance NUMA memory for this cgroup?
NumaRebalance = 0,
/// Which compression algorithm for this cgroup?
CompressionSelect = 1,
/// Optimal swap tier ratio (local vs remote)?
SwapTierSelect = 2,
/// Anomaly score for this device's health metrics?
DeviceAnomaly = 3,
/// Optimal power budget for current workload mix?
PowerBudget = 4,
}
Latency budget — Tier 2 IPC round-trip is ~50-200μs (ring buffer submission + context switch + userspace inference + ring buffer reply). This is acceptable for decisions made every 1-30 seconds. For any decision needed per-I/O or per-fault, the in-kernel inference engine (Section 45) remains the only option.
Fallback hierarchy:
Tier 2 inference service available?
├── Yes → Use Tier 2 for slow-path decisions, in-kernel for hot-path
│ Tier 2 periodically refreshes in-kernel model weights
│
└── No → In-kernel inference engine with static model
├── Model loaded? → Use model
└── No model? → Heuristic fallback (LRU, FIFO, static thresholds)
The system is fully functional at every level of the fallback hierarchy. Tier 2 inference services are a performance optimization, not a requirement. A system with no ML models at all behaves identically to a traditional kernel using standard heuristics.
Shipped Tier 2 services — ISLE ships the following reference Tier 2 inference services (optional, loaded on demand):
| Service | Model type | Purpose |
|---|---|---|
isle-ml-numa |
Gradient-boosted trees | NUMA page placement and migration decisions |
isle-ml-compress |
Random forest | Per-cgroup compression algorithm selection |
isle-ml-anomaly |
Isolation forest | FMA device anomaly detection |
isle-ml-power |
Online RL (contextual bandit) | RAPL power budget optimization |
Each is a standalone Tier 2 driver (~500-2000 LOC) that implements the
InferenceServiceVTable. They are independently upgradable, independently crashable
(Tier 2 process restart in ~10ms), and independently optional.
46. Accelerator Networking, RDMA, and Linux GPU Compatibility
46.1 RDMA and Collective Operations
46.1.1 Problem
Distributed ML training on N GPUs across M machines requires:
- All-Reduce: After computing gradients on each GPU, average them across all GPUs. This is the single most important collective operation in distributed training.
- All-Gather: Assemble distributed tensor shards.
- Reduce-Scatter: Reduce and distribute results.
- Point-to-point: Direct GPU-to-GPU transfer across network.
These operations need RDMA (Remote Direct Memory Access) for low latency and high
throughput. Linux provides RDMA through the rdma-core / libibverbs userspace stack,
but the kernel's role is minimal (IB/verbs kernel interface is thin).
46.1.2 Design: Kernel-Assisted Collectives
The kernel can optimize collectives by understanding the topology:
8-GPU training cluster:
Machine A: GPU 0, GPU 1 (NVLink connected)
Machine B: GPU 2, GPU 3 (NVLink connected)
Machine C: GPU 4, GPU 5 (NVLink connected)
Machine D: GPU 6, GPU 7 (NVLink connected)
All machines connected via 100GbE RDMA
Optimal all-reduce strategy:
1. Intra-machine: GPU 0 ↔ GPU 1 via NVLink (200 GB/s)
2. Inter-machine: GPU 0 ↔ GPU 2 ↔ GPU 4 ↔ GPU 6 via RDMA (12.5 GB/s)
3. Intra-machine: Broadcast result GPU 0 → GPU 1, etc.
The kernel knows the topology (device registry + network fabric).
Userspace NCCL/RCCL libraries currently discover this themselves.
The kernel can provide it as a service.
46.1.3 RDMA KABI Interface
/// RDMA device vtable (extends AccelBase for RDMA-capable NICs).
#[repr(C)]
pub struct RdmaDeviceVTable {
pub vtable_size: u64,
pub version: u32,
/// Create a protection domain (PD).
pub create_pd: unsafe extern "C" fn(
ctx: *mut c_void,
out_pd: *mut RdmaPdHandle,
) -> IoResultCode,
/// Register a memory region for RDMA.
/// The registered region can be targeted by remote DMA.
pub register_mr: unsafe extern "C" fn(
ctx: *mut c_void,
pd: RdmaPdHandle,
addr: u64, // Virtual or physical address
size: u64,
access: RdmaAccessFlags,
out_mr: *mut RdmaMrHandle,
out_rkey: *mut u32,
) -> IoResultCode,
/// Register device memory (GPU VRAM) for RDMA.
/// Enables GPUDirect RDMA: remote machine can DMA directly to/from GPU.
pub register_device_mr: Option<unsafe extern "C" fn(
ctx: *mut c_void,
pd: RdmaPdHandle,
accel_mem: AccelMemHandle,
offset: u64,
size: u64,
access: RdmaAccessFlags,
out_mr: *mut RdmaMrHandle,
out_rkey: *mut u32,
) -> IoResultCode>,
/// Post a send work request.
pub post_send: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
wr: *const RdmaSendWr,
) -> IoResultCode,
/// Post a receive work request.
pub post_recv: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
wr: *const RdmaRecvWr,
) -> IoResultCode,
/// Poll completion queue.
pub poll_cq: unsafe extern "C" fn(
ctx: *mut c_void,
cq: RdmaCqHandle,
max_entries: u32,
out_wc: *mut RdmaWorkCompletion,
out_count: *mut u32,
) -> IoResultCode,
/// Create a queue pair (QP) for RDMA communication.
pub create_qp: unsafe extern "C" fn(
ctx: *mut c_void,
pd: RdmaPdHandle,
qp_type: RdmaQpType,
send_cq: RdmaCqHandle,
recv_cq: RdmaCqHandle,
out_qp: *mut RdmaQpHandle,
) -> IoResultCode,
/// Destroy a queue pair.
pub destroy_qp: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
) -> IoResultCode,
/// Create a completion queue (CQ).
pub create_cq: unsafe extern "C" fn(
ctx: *mut c_void,
min_entries: u32,
out_cq: *mut RdmaCqHandle,
) -> IoResultCode,
/// Destroy a completion queue.
pub destroy_cq: unsafe extern "C" fn(
ctx: *mut c_void,
cq: RdmaCqHandle,
) -> IoResultCode,
/// Deregister a memory region.
pub deregister_mr: unsafe extern "C" fn(
ctx: *mut c_void,
mr: RdmaMrHandle,
) -> IoResultCode,
/// Destroy a protection domain.
pub destroy_pd: unsafe extern "C" fn(
ctx: *mut c_void,
pd: RdmaPdHandle,
) -> IoResultCode,
// === Connection Management (RC queue pairs) ===
/// Resolve a route to a remote RDMA destination and initiate connection.
/// Transitions the QP from INIT → RTR → RTS (for RC transport).
/// `dest` contains the remote node's LID/GID, QP number, and path info.
pub connect_qp: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
dest: *const RdmaConnParams,
) -> IoResultCode,
/// Bind a QP to a local port and listen for incoming RC connections.
/// The QP must be in the INIT state. When a remote peer calls
/// `connect_qp` targeting this endpoint, the kernel invokes the
/// `conn_request_callback` registered via `set_conn_callback`.
pub listen_qp: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
port: u8,
) -> IoResultCode,
/// Accept an incoming RC connection request on a listening QP.
/// Transitions the QP to RTS. `request_id` is the identifier
/// provided by the `conn_request_callback`.
pub accept_conn: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
request_id: u64,
resp_params: *const RdmaConnParams,
) -> IoResultCode,
/// Gracefully disconnect an RC queue pair. Sends a DREQ to the
/// remote peer and transitions the QP to the SQD (Send Queue Drained)
/// and then ERROR state. Outstanding work requests are flushed with
/// error completions. After disconnect, the QP can be destroyed.
pub disconnect_qp: unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
) -> IoResultCode,
/// Register a callback for incoming connection requests on a listening QP.
/// The callback receives a `request_id` that can be passed to `accept_conn`
/// or ignored (to reject the connection).
pub set_conn_callback: Option<unsafe extern "C" fn(
ctx: *mut c_void,
qp: RdmaQpHandle,
callback: unsafe extern "C" fn(
qp: RdmaQpHandle,
request_id: u64,
remote_params: *const RdmaConnParams,
),
) -> IoResultCode>,
}
/// Connection parameters for RC queue pair connection management.
#[repr(C)]
pub struct RdmaConnParams {
/// Remote QP number (24 bits, upper 8 bits reserved).
pub remote_qpn: u32,
/// Remote LID (local identifier, for InfiniBand subnets).
pub remote_lid: u16,
/// Service level (QoS, 0-15).
pub service_level: u8,
/// Global routing: remote GID (for RoCE and cross-subnet IB).
pub remote_gid: [u8; 16],
/// Explicit padding: `remote_gid` ends at offset 23; `path_mtu` (u32)
/// needs 4-byte alignment, so 1 byte of padding aligns it to offset 24.
pub _pad1: u8,
/// Path MTU (256, 512, 1024, 2048, 4096 bytes).
pub path_mtu: u32,
/// Retry count for RC transport (0-7).
pub retry_count: u8,
/// RNR (Receiver Not Ready) retry count (0-7).
pub rnr_retry: u8,
/// Private data for application-level connection negotiation (up to 56 bytes).
pub private_data: [u8; 56],
/// Length of valid private data bytes.
pub private_data_len: u8,
pub _pad: [u8; 3],
}
46.1.4 Topology Export
The kernel exports accelerator and network topology so that collective libraries (NCCL, RCCL, Gloo, etc.) can make optimal routing decisions:
/sys/kernel/isle/topology/
accelerators # List of all accelerators with NUMA/PCIe location
p2p_matrix # NxN matrix of P2P bandwidth between accelerators
rdma_links # RDMA network links with bandwidth/latency
collective_groups # Pre-computed optimal collective groups
$ cat /sys/kernel/isle/topology/p2p_matrix
# Bandwidth in GB/s (0 = no direct path)
# GPU0 GPU1 GPU2 GPU3
GPU0 - 200 12.5 12.5
GPU1 200 - 12.5 12.5
GPU2 12.5 12.5 - 200
GPU3 12.5 12.5 200 -
This is additive — NCCL currently discovers topology through a combination of nvidia-smi
topo, sysfs traversal, and trial-and-error. A kernel-provided topology file is faster
and more reliable.
46.2 Linux Compatibility Layer
46.2.1 DRM/KMS Compatibility
Linux's Direct Rendering Manager (DRM) is the standard GPU kernel interface. Userspace
graphics stacks (Mesa, Vulkan, Wayland compositors) use it via /dev/dri/card* and
/dev/dri/renderD*.
ISLE provides DRM compatibility through isle-compat:
Userspace (Vulkan, OpenGL, CUDA, etc.)
|
| Standard DRM/KMS ioctls
v
isle-compat/src/drm/
|
| Translates DRM ioctls to isle-accel KABI calls
v
isle-accel scheduler + AccelBase/AccelCompute vtable
|
| KABI vtable calls
v
GPU driver (Tier 1, domain-isolated)
DRM ioctls translated to isle-accel operations:
| DRM ioctl | isle-accel equivalent |
|---|---|
DRM_IOCTL_MODE_* |
Display subsystem (AccelDisplayVTable, not covered here) |
DRM_IOCTL_*_CTX_CREATE/DESTROY |
create_context / destroy_context |
DRM_IOCTL_*_GEM_CREATE |
alloc_device_memory |
DRM_IOCTL_*_GEM_MMAP |
map_device_memory |
DRM_IOCTL_*_EXECBUFFER |
submit_commands (via AccelScheduler) |
DRM_IOCTL_*_WAIT |
poll_completion |
DRM_IOCTL_PRIME_* |
P2P DMA handle export/import |
46.2.2 NVIDIA Compatibility
NVIDIA's userspace stack (CUDA runtime, cuDNN, TensorRT) communicates with the kernel
driver through proprietary ioctls on /dev/nvidia*. The compatibility approach:
Option A: NVIDIA ships a KABI-native kernel interface layer
- NVIDIA already has a clean internal split between their proprietary
compute core and the "kernel interface layer" (nvidia.ko)
- ISLE provides a KABI implementation of this interface layer
- NVIDIA's compute core links against our KABI implementation
- Same approach described in Section 55.4
Option B: ioctl compatibility shim
- Translate NVIDIA's /dev/nvidia* ioctls to isle-accel KABI calls
- More fragile (NVIDIA changes their ioctl interface between releases)
- Stopgap until Option A is available
Option A is strongly preferred and is the medium-term plan.
46.2.3 Feasibility Analysis: Porting Open-Source NVIDIA Drivers
NVIDIA open-sourced their kernel modules (nvidia.ko, nvidia-uvm.ko, nvidia-modeset.ko) in 2022 under dual MIT/GPLv2 license. This section analyzes the feasibility of writing a new ISLE-native driver based on this open-source code, preserving binary compatibility with NVIDIA's proprietary userspace stack (CUDA, cuDNN, TensorRT, etc.).
Why this matters: If unmodified libcuda.so, libnvidia-ml.so, cuDNN, TensorRT,
and the entire CUDA toolkit work on ISLE without recompilation, the kernel is
immediately viable for the entire GPU computing ecosystem.
NVIDIA's Architecture (Post-2022 Open-Source Release)
The modern NVIDIA driver stack has a clean three-layer architecture:
Layer 3: Proprietary Userspace (binary-only, NOT recompiled)
┌─────────────────────────────────────────────────────┐
│ libcuda.so (CUDA runtime) │
│ libnvidia-ml.so (NVML monitoring) │
│ cuDNN, TensorRT, NCCL │
│ Vulkan/OpenGL ICD (libnvidia-glcore.so) │
│ NVENC/NVDEC (video encode/decode) │
└──────────────────────┬──────────────────────────────┘
│ ioctl() on /dev/nvidia*
│ (stable-ish interface, versioned by driver release)
v
Layer 2: Open-Source Kernel Module (MIT/GPLv2, THIS IS WHAT WE PORT)
┌─────────────────────────────────────────────────────┐
│ nvidia.ko │
│ ├── OS interface layer (nv-linux.c, nv-pat.c, │
│ │ nv-mmap.c, nv-i2c.c, nv-acpi.c, etc.) │
│ │ → Linux kernel API calls (PCI, DMA, IRQ, MM) │
│ │ → THIS is what we rewrite for ISLE │
│ │ │
│ ├── RM (Resource Manager) core │
│ │ → Hardware-agnostic resource management │
│ │ → Talks DOWN to GSP firmware via RM RPC │
│ │ → Talks UP to userspace via control ioctls │
│ │ → We can reuse this largely unchanged │
│ │ │
│ └── Entry points: module_init/exit, ioctl dispatch │
│ → Rewrite for KABI driver lifecycle │
│ │
│ nvidia-uvm.ko (Unified Virtual Memory) │
│ ├── Page fault handler, migration engine │
│ ├── Uses Linux HMM (mmu_notifiers) │
│ └── Integrate with isle-core HMM (Section 43) │
│ │
│ nvidia-modeset.ko (Display/KMS) │
│ ├── Mode setting, display management │
│ └── Map to AccelDisplayVTable │
└──────────────────────┬──────────────────────────────┘
│ RM RPC (register read/write commands)
v
Layer 1: GSP Firmware (runs ON the GPU, opaque, NOT our concern)
┌─────────────────────────────────────────────────────┐
│ GPU System Processor firmware │
│ - Loaded by kernel module during initialization │
│ - Handles actual hardware programming │
│ - Context switching, power management, ECC, etc. │
│ - Communicates with kernel via shared memory + IRQ │
│ - Binary blob, but runs on GPU, not on CPU │
│ - We just need to load it and talk RM RPC to it │
└─────────────────────────────────────────────────────┘
Key insight: nvidia.ko is NOT a traditional monolithic driver. On modern GPUs (Turing and newer, i.e., everything from 2018+), the kernel module is primarily a thin translation layer between Linux kernel APIs and the GSP firmware running on the GPU itself. The "intelligence" is in the GSP firmware, not in the kernel module.
Component-by-Component Porting Analysis
Component 1: OS Interface Layer → KABI Translation (Mechanical, ~60% of work)
| Linux API Used | ISLE Equivalent | Difficulty |
|---|---|---|
pci_register_driver() |
KABI device registry match + register_driver() |
Trivial |
request_irq() / free_irq() |
KABI request_irq() / free_irq() |
Trivial |
dma_alloc_coherent() / dma_map_*() |
KABI dma_alloc() / dma_map() |
Trivial |
ioremap() / iounmap() |
KABI ioremap() / iounmap() |
Trivial |
alloc_pages() / __free_pages() |
KABI alloc_pages() |
Trivial |
mutex_lock() / spin_lock() |
KABI mutex / spinlock (or Rust equivalents) |
Trivial |
timer_setup() / schedule_work() |
KABI timer / workqueue equivalents |
Easy |
pci_enable_msi() / MSI-X |
KABI MSI/MSI-X support | Easy |
sysfs_create_group() |
KABI property export (device registry) | Easy |
| ACPI methods (power management) | KABI power state callbacks | Easy |
vm_area_struct / vm_operations |
KABI memory mapping interface | Medium |
mmu_notifier_*() (for UVM) |
ISLE HMM PageLocationTracker (Section 43) |
Medium |
/proc/driver/nvidia/ |
/sys/kernel/isle/accel/ (Section 46.2.5) |
Easy |
This is ~90+ files in the open-source nvidia.ko that implement Linux kernel API calls. The work is mechanical: each Linux API call maps to its KABI equivalent. There are no algorithmic decisions — it's pure API translation.
Component 2: ioctl Interface → /dev/nvidia* Compatibility (Critical)
The binary userspace communicates exclusively through ioctls on:
- /dev/nvidia0, /dev/nvidia1, ... (per-GPU control)
- /dev/nvidiactl (system-level control)
- /dev/nvidia-uvm (unified virtual memory)
- /dev/nvidia-uvm-tools (profiling)
- /dev/nvidia-modeset (display)
Approach:
isle-compat provides /dev/nvidia* character devices.
Each ioctl is dispatched to the ISLE NVIDIA driver (Tier 1).
The driver's RM core processes the ioctl exactly as it does on Linux.
The critical point: we do NOT reinterpret ioctls.
We pass them through to the same RM core code (open-source).
The RM core talks to GSP firmware, which does the actual work.
The ioctl ABI is preserved byte-for-byte.
The ioctl numbers and structures are defined in NVIDIA's open-source headers
(nvidia-uvm/uvm_linux_ioctl.h, nvidia/nv-ioctl-numbers.h). Since the RM core is
open-source and handles ioctl dispatch internally, we preserve the exact same ioctl
interface. Binary libcuda.so cannot distinguish ISLE from Linux.
Component 3: UVM (Unified Virtual Memory) → ISLE HMM Integration (Hardest Part)
nvidia-uvm.ko implements CUDA Unified Memory. On Linux, it hooks deeply into the memory management subsystem:
Linux UVM hooks:
- mmu_notifier (track CPU page table changes)
- hmm_range_fault() (resolve CPU page faults for GPU access)
- migrate_vma() (migrate pages between CPU and GPU)
- Fault handler for GPU page faults (ATS/PRI)
ISLE's advantage: Section 43 (Heterogeneous Memory Management) provides exactly these primitives as first-class kernel features. The mapping is:
| Linux UVM mechanism | ISLE equivalent |
|---|---|
mmu_notifier_register() |
PageLocationTracker subscription |
hmm_range_fault() |
ISLE HMM handle_device_fault() callback |
migrate_vma_setup/pages/finalize() |
AccelBase migrate_pages() |
fault_handler() for ATS |
ISLE device fault handler (Section 43.1.4) |
| Per-process GPU page tables | AccelContext memory management |
| UVM counters / access tracking | ISLE memory accounting (cgroup accel.memory.*) |
This is the most complex component because UVM deeply interweaves with memory management internals. However, ISLE's HMM design (Section 43) was designed with exactly this use case in mind, so the mapping is cleaner than on Linux.
Component 4: GSP Firmware Loading and Communication (Straightforward)
The kernel module loads GSP firmware from the filesystem (/lib/firmware/nvidia/)
into GPU VRAM, then communicates via shared memory regions and interrupts. This is
self-contained within the RM core and does not depend on Linux APIs beyond basic
DMA and interrupt handling — both of which map trivially to KABI.
Component 5: Display / KMS (nvidia-modeset.ko) (Standard)
Maps to AccelDisplayVTable. Standard DRM/KMS translation (Section 46.2.1) handles most of this. NVIDIA's modeset module is relatively thin compared to the compute path.
Difficulty Rating Summary
| Component | Difficulty | Notes |
|---|---|---|
| OS interface layer rewrite | 3/10 | Mechanical API translation |
ioctl passthrough (/dev/nvidia*) |
2/10 | Character device + dispatch |
| GSP firmware loading | 2/10 | Self-contained in RM core |
| PCI/DMA/IRQ setup | 2/10 | Trivial KABI mapping |
| UVM → ISLE HMM integration | 7/10 | Deepest integration point |
| Display / modeset | 4/10 | Standard DRM/KMS path |
| Power management (ACPI/PCIe) | 3/10 | Device registry PM callbacks |
| Testing + binary userspace validation | 5/10 | Must test full CUDA stack |
Binary Userspace Compatibility Verification
The following must work without recompilation on ISLE:
| Component | Interface Used | Compatibility Path |
|---|---|---|
libcuda.so (CUDA runtime) |
/dev/nvidia* ioctls |
ioctl passthrough to RM core |
libnvidia-ml.so (NVML) |
/dev/nvidiactl ioctls + sysfs |
ioctl passthrough + sysfs compat |
| cuDNN / TensorRT | Links against libcuda.so | Transitive (libcuda works → these work) |
nvidia-smi |
NVML library | Transitive |
| NCCL (multi-GPU) | libcuda + /dev/nvidia-uvm |
UVM compat + P2P DMA support |
| Vulkan ICD | DRM + /dev/nvidia* |
DRM compat (8.1) + ioctl passthrough |
| NVENC/NVDEC | /dev/nvidia* ioctls |
ioctl passthrough |
| Container runtime (nvidia-container-toolkit) | cgroup + device files | cgroup compat + device file compat |
What ISLE Does Better Than Linux for NVIDIA GPUs
| Capability | Linux Behavior | ISLE Behavior |
|---|---|---|
| GPU crash recovery | System reboot required | Driver reload in ~100-500ms (Section 42.3.2) |
| GPU scheduling | Driver-internal, invisible | Kernel-managed, cgroup-integrated (Section 42.2.4) |
| GPU memory limits | None (driver tracks, no enforcement) | cgroup accel.memory.max (Section 44.2) |
| GPU compute QoS | None | cgroup accel.compute.guarantee (Section 44.4) |
| GPU memory in OOM killer | Invisible | Full visibility, OOM-killable (Section 44.3) |
| Multi-tenant isolation | MIG only (hardware-dependent) | Software scheduling + MIG (Section 44) |
| GPU observability | nvidia-smi (polling) | Stable tracepoints + eBPF (Section 42.3.4) |
| UVM performance | Bolted-on HMM, driver-specific | First-class HMM, kernel-managed (Section 43) |
| P2P DMA (GPUDirect) | NVIDIA-specific API | Generalized KABI P2P (Section 43) |
| Power management | Driver-internal | Topology-driven, device-registry-integrated |
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| NVIDIA changes ioctl ABI between releases | Medium | High | Pin to specific driver release initially; NVIDIA's ioctl ABI is versioned and backwards-compatible in practice |
| UVM integration bugs cause data corruption | Low | Critical | Extensive testing with CUDA memory stress tests; UVM has well-defined semantics |
| GSP firmware version incompatibility | Low | High | Support specific firmware versions matching the open-source driver release |
| Performance regression vs Linux | Medium | Medium | Profile early; most performance is in GSP firmware and userspace, not kernel module |
| NVIDIA open-source license compliance | None | N/A | MIT/GPLv2 dual license; our OKLF is compatible with both |
Implementation Strategy
Phase 1: Basic GPU Initialization
- Port OS interface layer to KABI
- GSP firmware loading
- /dev/nvidia* character devices with ioctl passthrough
- PCI, DMA, IRQ setup via KABI
- Goal: nvidia-smi shows GPU info
Phase 2: Compute Workloads
- Full RM core integration
- CUDA simple programs work (vector add, matmul)
- Basic context management through AccelBase
- Goal: CUDA samples compile and run
Phase 3: UVM and Multi-GPU
- UVM integration with ISLE HMM
- Unified memory (cudaMallocManaged) works
- P2P DMA for multi-GPU (NVLink, PCIe)
- NCCL multi-GPU collectives
- Goal: PyTorch distributed training works
Phase 4: Production Hardening
- Display / modeset (AccelDisplayVTable)
- Full cgroup integration
- Crash recovery testing
- Performance parity validation
- Goal: Production-ready for datacenter + desktop
Licensing Note
NVIDIA's open-source kernel modules are dual-licensed MIT/GPLv2. Since we are writing a new driver inspired by the open-source code (not copy-pasting it), and our KABI interface layer is original work, there are no licensing concerns. The ioctl numbers and structures are functional interfaces (facts, not copyrightable expression). The GSP firmware is a binary blob loaded onto the GPU (not linked into our kernel). This is fully compatible with OKLF v1.3.
46.2.4 VFIO Passthrough
For VMs that need direct device access. VFIO is a general-purpose mechanism (see Section 7.3.8) — it works identically for GPUs, NICs, NVMe controllers, and any other PCIe device. The GPU-specific example:
/dev/vfio/ interface (Section 6, Tier 2 driver path)
|
v
isle-kvm (KABI Tier 1 driver)
|
| Programs IOMMU for VM isolation
v
PCIe device (entire device assigned to VM, VM's guest driver manages it)
VFIO passthrough works unchanged from Linux. The device is assigned to the VM at the IOMMU group level (Section 7.3.8).
46.2.5 ISLE-Specific Interfaces (Superset)
Beyond Linux compatibility, new interfaces for ISLE-aware software:
/dev/isle-accel-0 # ISLE-native accelerator access
/dev/isle-accel-1 # (one per accelerator device)
/sys/kernel/isle/accel/
devices/
0/
info # AccelDeviceInfo (JSON-formatted)
utilization # Current utilization stats
contexts # Active context list
memory # Memory usage breakdown
power # Power/thermal/clock state
topology # PCIe/NVLink/xGMI connections
partitions/ # MIG/SR-IOV partitions (if available)
0/
info
bind # Bind to cgroup
scheduler/
policy # Global scheduling policy
stats # Global scheduling statistics
topology/
p2p_matrix # Bandwidth matrix
rdma_links # Network links
numa_map # Accelerator-to-NUMA mapping
inference/
models/ # In-kernel model management (Section 45)
Existing Linux tools (nvidia-smi, rocm-smi, intel_gpu_top) continue to work
through the DRM/sysfs compatibility layer. ISLE-specific tools (isle-accel-top,
isle-gpu-smi) can use the richer /sys/kernel/isle/accel/ interface for more
detailed information and control.
46.2.6 Display Stack: Wayland and Buffer Sharing
The modern Linux display stack centers on Wayland compositors consuming DRM/KMS (Section 46.2.1). ISLE provides the full kernel infrastructure these compositors require, with crash-recovery advantages that Linux cannot offer.
DMA-BUF (Cross-Device Buffer Sharing)
DMA-BUF is the kernel primitive for sharing memory buffers between devices — GPU to display controller, GPU to video encoder, camera to GPU, GPU to network (RDMA). In Linux, DMA-BUF is a global namespace with weak access control. In ISLE, each DMA-BUF is a capability-protected kernel object:
DMA-BUF lifecycle:
1. Exporter (e.g., GPU driver) creates a DMA-BUF from device memory
2. Returns a capability token (not a raw file descriptor)
3. Importer (e.g., display driver) receives the capability via IPC
4. Importer maps the DMA-BUF into its device's address space
5. Zero-copy path: both devices access the same physical memory
For Linux compatibility, DMA-BUF capabilities are presented as file descriptors through the compat layer, so existing userspace (Mesa, Wayland compositors, GStreamer) works unmodified.
Explicit Synchronization (dma_fence / sync_file)
GPU and display operations are asynchronous. Explicit sync primitives coordinate them:
- dma_fence: kernel-internal synchronization point representing GPU work completion. Created by the GPU driver when work is submitted, signaled when the GPU finishes.
- sync_file: userspace-visible wrapper around one or more dma_fences. Exported as a file descriptor. Wayland compositors use these to know when a client's rendering is complete before presenting.
- Timeline semaphores: Vulkan-style monotonic sync objects. More efficient than binary fences for pipelined workloads (render frame N while displaying frame N-1).
ISLE implements the Linux-compatible sync_file ioctl interface
(SYNC_IOC_MERGE, SYNC_IOC_FILE_INFO) so existing Wayland compositors and Vulkan
drivers work without modification.
GBM (Generic Buffer Manager)
GBM is the userspace buffer allocation library that Wayland compositors use to allocate
scanout-capable buffers. It talks to DRM via ioctl. ISLE's DRM compatibility layer
(Section 46.2.1) ensures GBM works unmodified — gbm_create_device(), gbm_bo_create(),
and related functions issue standard DRM ioctls that ISLE handles.
Render Nodes (/dev/dri/renderD*)
Render nodes provide unprivileged GPU compute and render access without modesetting capability. Any user can open a render node to submit GPU work (3D rendering, compute shaders, video encode/decode) without needing root or DRM master status. ISLE exposes render nodes via the DRM compat layer, mapping each to an isle-accel context creation with appropriate capability restrictions (no modesetting, no display control).
DRM Leases
DRM leases allow a client to take exclusive control of a display connector — the
primary consumer is VR headsets (SteamVR uses DRM leases to drive the HMD display
independently from the desktop compositor). ISLE supports the lease ioctl family
(DRM_IOCTL_MODE_CREATE_LEASE, DRM_IOCTL_MODE_LIST_LESSEES, etc.) through the DRM
compatibility layer.
KMS Atomic Modesetting
The modern display pipeline API. Replaces legacy drmModeSetCrtc with atomic commits
that update multiple display properties (resolution, refresh, gamma, position) as a
single transaction — either the entire commit succeeds or nothing changes. ISLE's DRM
layer implements the full atomic commit interface (DRM_IOCTL_MODE_ATOMIC) including:
- TEST_ONLY mode (compositor can test configurations without applying them)
- Non-blocking commits (compositor doesn't stall waiting for vblank)
- Property-based interface (all display state expressed as properties)
Multi-GPU Display (PRIME)
Render on GPU A, scanout on GPU B via DMA-BUF sharing (known as PRIME in the DRM ecosystem). The P2P DMA infrastructure (Section 43) handles the underlying data transfer. For the display case specifically: - DMA-BUF export from render GPU → DMA-BUF import on display GPU - If GPUs share a PCIe switch, transfer is direct P2P (no CPU copy) - If GPUs are on different NUMA nodes, transfer goes through host memory
HDR and Wide Color Gamut
Modern displays support HDR10, Dolby Vision, and wide color gamuts (DCI-P3, Rec. 2020).
Kernel support is exposed via KMS atomic properties:
- HDR_OUTPUT_METADATA: HDR static/dynamic metadata per connector
- COLORSPACE: signal color space to the display (BT.709, DCI-P3, BT.2020)
- MAX_BPC: maximum bits-per-channel for the connector (8, 10, 12, 16)
- COLOR_ENCODING / COLOR_RANGE: YCbCr encoding and quantization range
These are standard KMS properties exposed through atomic modesetting. Wayland compositors (wlroots, Mutter, KWin) use these to negotiate HDR output with the display.
Crash Recovery Advantage
If the DRM/GPU driver crashes (Tier 1 reload, Section 9):
- The driver binary is reloaded in ~50-150ms
- DMA-BUF metadata (size, format, sharing graph, capability tokens) survives the reload — this metadata is managed by isle-core's memory subsystem, not the driver. Buffer handles remain valid. However, VRAM contents are lost on device reset — a GPU reset physically reinitializes the device memory, destroying all framebuffer data, texture contents, and compute buffers stored in VRAM. Applications must re-upload data to the GPU after recovery.
- Sync objects are re-created in signaled state (conservative — forces resubmission of any pending work, but doesn't deadlock the compositor)
- The Wayland compositor sees a brief stall (~100-500ms, the full recovery window) and must re-render its buffers (VRAM contents are lost), but does not lose its buffer handles or capability tokens, does not crash, and does not need to re-negotiate with clients. The compositor's CPU-side state (window tree, client list, layout) is fully preserved.
- Running applications still hold valid DMA-BUF capabilities. They receive an
ENODATAerror on the next access to a VRAM-backed buffer, indicating that the buffer contents are stale and must be re-uploaded. Applications that maintain CPU-side copies of their rendering data (the common case for double-buffered Wayland clients) can re-upload and resume rendering immediately. Applications that relied solely on VRAM-resident data with no CPU-side copy must regenerate or reload from disk.
In Linux, a GPU driver crash typically kills the entire X/Wayland session, requiring all graphical applications to restart. ISLE's display stack crash recovery preserves the sharing graph and buffer metadata, allowing rapid re-rendering rather than full session teardown. This is a direct consequence of the capability-managed DMA-BUF design and Tier 1 driver reload.