Sections 5–9 of the ISLE Architecture. For the full table of contents, see README.md.
Part II: Driver Architecture
The driver model, isolation tiers, device management, I/O paths, and crash recovery.
5. Driver Model and Stable ABI (KABI)
5.1 The Problem We Solve
Linux has NO stable in-kernel ABI. This means:
- Drivers must recompile with every kernel update
- Nvidia ships binary blobs that constantly break
- DKMS rebuilds are fragile and fail
- The community's answer is "get upstream or suffer"
- Enterprise customers cannot independently update kernel and drivers
ISLE provides a stable, versioned, append-only C-ABI (called KABI) that survives kernel updates. A driver compiled against KABI v1 will load and run correctly on any future kernel that supports KABI v1 -- without recompilation.
5.2 Interface Definition Language (.kabi)
All driver interfaces are defined in .kabi IDL files. The kabi-compiler tool
generates both Rust and C bindings from these definitions.
// interfaces/block_device.kabi
@version(1)
interface BlockDevice {
fn submit_io(op: IoOp, lba: u64, count: u32, buf: DmaBuffer) -> IoResult;
fn poll_completion(handle: RequestHandle) -> PollResult;
fn get_capabilities() -> BlockCapabilities;
}
@version(2) @extends(BlockDevice, 1)
interface BlockDeviceV2 {
fn discard_blocks(lba: u64, count: u32) -> IoResult;
fn zone_management(op: ZoneOp, zone: u64) -> ZoneResult;
}
This compiles down to a C-compatible vtable:
#[repr(C)]
pub struct BlockDeviceVTable {
pub vtable_size: u64, // Primary version discriminant
pub version: u32,
// V1 methods -- mandatory, never Option
pub submit_io: unsafe extern "C" fn(
ctx: *mut c_void, op: IoOp, lba: u64, count: u32, buf: DmaBuffer,
) -> IoResult,
pub poll_completion: unsafe extern "C" fn(
ctx: *mut c_void, handle: RequestHandle,
) -> PollResult,
pub get_capabilities: unsafe extern "C" fn(
ctx: *mut c_void,
) -> BlockCapabilities,
// V2 methods -- optional, wrapped in Option for graceful absence
pub discard_blocks: Option<unsafe extern "C" fn(
ctx: *mut c_void, lba: u64, count: u32,
) -> IoResult>,
pub zone_management: Option<unsafe extern "C" fn(
ctx: *mut c_void, op: ZoneOp, zone: u64,
) -> ZoneResult>,
}
5.3 ABI Rules (Enforced by CI)
These rules are non-negotiable and enforced by the kabi-compat-check tool in CI:
- Vtables are append-only -- new methods are added at the end only.
- Existing methods are never removed, reordered, or changed in signature.
- All types crossing the ABI use
#[repr(C)]with explicit sizes (u32,u64; neverusizewhich varies by platform). - Enums use
#[repr(u32)]with explicit discriminant values. - New struct fields are appended only, never removed or reordered.
- The
vtable_sizefield enables runtime version detection. A kernel can determine which methods are present by comparingvtable_sizeagainst known offsets. - Padding fields are reserved and must be zero-initialized for forward compatibility.
5.3a KABI Version Lifecycle and Deprecation Policy
Append-only vtables ensure forward compatibility indefinitely — a driver compiled against KABI v1 runs on a kernel implementing KABI v47. But without a deprecation policy, vtables grow without bound, accumulating dead methods that no driver uses, wasting cache lines, and complicating auditing. This section defines the lifecycle.
Version numbering — KABI versions are integer-incremented (v1, v2, v3...). Each version corresponds to a vtable layout. A new version is minted when methods are appended or struct fields are added (never removed or reordered). Major kernel releases bump the KABI version; minor releases do not.
Support window — each KABI version is supported for 5 major kernel releases from the release that introduced it. This provides a concrete, predictable window:
KABI v1: introduced in ISLE 1.0 → supported through ISLE 5.x → deprecated in 6.0
KABI v5: introduced in ISLE 5.0 → supported through ISLE 9.x → deprecated in 10.0
Deprecation process:
- Deprecation announcement (N-2 releases before removal): KABI v1 is marked
deprecated when ISLE 4.0 ships. Loading a driver built against a deprecated KABI
version logs a warning:
isle: driver nvme.ko uses deprecated KABI v1 (supported until ISLE 5.x, rebuild recommended) - Compatibility shim (during deprecation window): deprecated vtable methods are backed by shim implementations that translate old calls to current equivalents. This is a vtable-level adapter, not per-call overhead.
- Removal (at window expiry): when ISLE 6.0 ships, the KABI v1 compatibility shim
is removed. Drivers compiled against KABI v1 fail to load with a clear error:
isle: driver nvme.ko requires KABI v1 (minimum supported: v2) - Never break within window: a driver compiled against any supported KABI version must load and function correctly. This is a hard contract, verified by CI testing with driver binaries compiled against every supported KABI version.
Vtable compaction — when a KABI version is removed, the kernel MAY reorganize internal vtable storage to reclaim space from removed shims. This is invisible to drivers (they see only their own KABI version's vtable layout, which never changes within the support window). Compaction is an implementation optimization, not a semantic change.
Practical impact — with annual major releases and a 5-release window, drivers have ~5 years before they must recompile. This is dramatically longer than Linux's "recompile every kernel update" reality, while avoiding the "append forever" problem.
5.4 Bilateral Capability Exchange
Unlike Linux's global kernel symbol table (EXPORT_SYMBOL), ISLE uses a bilateral
vtable exchange model. There are no global symbols, no symbol versioning, and no
uncontrolled dependencies.
Driver Loading Sequence:
1. Kernel resolves ONE well-known symbol: __kabi_driver_entry
2. Kernel passes KernelServicesVTable TO driver
(this is what the kernel provides to the driver)
3. Driver passes DriverVTable TO kernel
(this is what the driver provides to the kernel)
4. All further communication flows through these two vtables
5. No other symbols are resolved -- ever
The KernelServicesVTable is also versioned and append-only:
#[repr(C)]
pub struct KernelServicesVTable {
pub vtable_size: u64,
pub version: u32,
// Memory management.
// All sizes use u64, not usize, to maintain ABI stability across
// 32-bit (ARMv7, PPC32) and 64-bit targets (rule 3, Section 5.3).
pub alloc_dma_buffer: unsafe extern "C" fn(
size: u64, align: u64, flags: AllocFlags,
) -> AllocResult,
pub free_dma_buffer: unsafe extern "C" fn(
handle: DmaBufferHandle,
) -> FreeResult,
// Interrupt management
pub register_interrupt: unsafe extern "C" fn(
irq: u32, handler: InterruptHandler, ctx: *mut c_void,
) -> IrqResult,
pub deregister_interrupt: unsafe extern "C" fn(
irq: u32,
) -> IrqResult,
// Logging
pub log: unsafe extern "C" fn(
level: u32, msg: *const u8, len: u32,
),
// Ring buffer creation (added in v2)
pub create_ring_buffer: Option<unsafe extern "C" fn(
entries: u32, entry_size: u32, flags: RingFlags,
) -> RingResult>,
// ... extends over time, always append-only ...
}
5.5 Version Negotiation
When a driver loads, version negotiation proceeds as follows:
1. Driver calls __kabi_driver_entry(kernel_vtable, &driver_vtable)
2. Driver reads kernel_vtable.version:
- If kernel version >= driver's minimum required version: proceed
- If kernel version < driver's minimum: return KABI_ERR_VERSION_MISMATCH
3. Kernel reads driver_vtable.version:
- If driver version >= kernel's minimum for this interface: proceed
- If driver version < kernel's minimum: reject driver with log message
4. Both sides use vtable_size to detect which optional methods are present
5. Optional methods (Option<fn>) are checked before each call
This allows: - Old drivers on new kernels (new kernel methods are simply not called) - New drivers on old kernels (driver checks for method presence, degrades gracefully) - Independent kernel and driver update cycles
See also: Section 51 (Safe Kernel Extensibility) extends the KABI vtable pattern to kernel policy modules, enabling hot-swappable scheduler policies, memory policies, and fault handlers using the same append-only ABI mechanism.
6. Driver Isolation Tiers
Tier Classification
| Property | Tier 0 | Tier 1 | Tier 2 |
|---|---|---|---|
| Location | In-kernel, statically linked | Ring 0, dynamically loaded | Ring 3, separate process |
| Isolation | None (part of core) | Hardware memory domains + IOMMU (see table below) | Full address space + IOMMU |
| Crash behavior | Kernel panic | Reload module (~50-150ms) | Restart process (~10ms) |
| DMA access | Unrestricted | IOMMU-fenced | IOMMU-fenced |
| Performance | Zero overhead | ~23 cycles bare switch + ~10-15 cycles marshaling (x86 MPK; varies by arch) | ~200-500 cycles per crossing |
| Trust level | Maximum (core kernel) | High (verified, signed) | Low (untrusted acceptable) |
| Examples | APIC, timer, early console | NVMe, NIC, TCP/IP, FS, GPU, KVM | USB, audio, input, BT, WiFi |
Tier 1 isolation mechanism per architecture:
The "hardware memory domains" used for Tier 1 isolation are architecture-specific. Not all architectures have a fast isolation mechanism; those without one fall back to page-table-based isolation or Tier 2 demotion. See Section 3 ("Per-Architecture Isolation Cost Analysis") for cycle costs and Section 3 ("Adaptive Isolation Policy") for fallback behavior.
| Architecture | Tier 1 Mechanism | Switch Cost | Domains | Availability |
|---|---|---|---|---|
| x86-64 | MPK (WRPKRU) |
~23 cycles | 12 usable | Intel Skylake+ / AMD Zen 3+ |
| x86-64 (fallback) | Page table + ASID | ~200-400 cycles | Unlimited | All x86-64 |
| AArch64 | POE (MSR POR_EL0 + ISB) |
~40-80 cycles | 7 usable | FEAT_S1POE, optional from ARMv8.9/ARMv9.4 |
| AArch64 (fallback) | Page table + ASID | ~150-300 cycles | Unlimited | All AArch64 |
| ARMv7 | DACR (MCR p15) |
~10-20 cycles | 15 usable | All ARMv7 (universal) |
| RISC-V 64 | (none with paging enabled) | N/A | N/A | No fast mechanism exists |
| RISC-V 64 (fallback) | Page table + ASID | ~200-500 cycles | Unlimited | All RISC-V 64 |
| PPC32 | Segment registers (mtsr) |
~10-30 cycles | 15 usable | All PPC32 |
| PPC64LE (POWER9+) | Radix PID (mtspr PIDR) |
~30-60 cycles | Process-scoped | POWER9+ with Radix MMU |
| PPC64LE (POWER8) | HPT + LPAR | ~200-400 cycles | Unlimited | POWER8 (fallback) |
On RISC-V 64, Tier 1 drivers always use the page-table fallback (equivalent to Tier 2
overhead) because no fast intra-address-space isolation mechanism exists with paging
enabled. The isolation=performance mode (Section 3) can promote Tier 1 to Tier 0 on
RISC-V to avoid this overhead, at the cost of losing memory isolation.
Tier 0: Boot-Critical Drivers
These are statically linked into ISLE Core and cannot be isolated because they are required before the isolation infrastructure is available:
- Local APIC and I/O APIC
- PIT/HPET/TSC timer
- Early serial/VGA console
- ACPI table parsing (early boot only). Security trade-off: ACPI tables are firmware-provided data that the kernel must trust at boot. A malicious or buggy BIOS can supply corrupt ACPI tables (malformed AML, overlapping MMIO regions, impossible NUMA topologies). ISLE's Tier 0 ACPI parser performs defensive parsing: all table lengths are bounds-checked, AML interpretation uses a sandboxed evaluator with a cycle limit (no infinite loops), and MMIO regions claimed by ACPI are validated against the e820/UEFI memory map before being mapped. Despite these defenses, ACPI parsing remains the largest attack surface in Tier 0. The firmware quirk framework (Section 7.11.4a) provides per-platform overrides for known-buggy tables.
Tier 0 code is held to the highest review standard and kept minimal.
Tier 1: Kernel-Adjacent Drivers (Hardware Memory Domain Isolated)
Performance-critical drivers run in Ring 0 but are isolated via hardware memory domains (MPK on x86-64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE, POE on AArch64 when available -- see "Tier 1 isolation mechanism per architecture" table above). Each driver (or driver group) is assigned a protection domain. The driver can only access:
- Its own private memory (tagged with its domain key)
- Shared ring buffers (tagged with the shared domain, read-write)
- Shared DMA buffers (tagged with DMA domain, read-write)
- Its MMIO regions (mapped with its domain key)
It cannot access: - ISLE Core private memory - Other Tier 1 drivers' private memory - Page tables, capability tables, or scheduler state - Arbitrary physical memory
Security limitation: Tier 1 isolation protects against bugs, not exploitation.
On x86-64, MPK isolation uses the WRPKRU instruction, which is unprivileged --
any Ring 0 code (including Tier 1 driver code) can execute it to modify its own
domain permissions and access any MPK-protected memory, including ISLE Core (PKEY 0).
This means a compromised Tier 1 driver with arbitrary code execution can trivially
bypass MPK isolation. On ARMv7, MCR to DACR is privileged (PL1), which is stronger
-- user-space cannot forge domain switches, but kernel-mode drivers still can. On
PPC32 and PPC64LE, segment register and AMR updates are similarly supervisor-mode.
Tier 1 threat model: MPK (and its architectural equivalents) provides defense against accidental corruption -- buffer overflows, use-after-free, null dereferences that happen to write to the wrong address. It does not defend against deliberate exploitation where an attacker achieves arbitrary code execution within a Tier 1 driver and intentionally escapes the domain. For the exploitation case, Tier 2 (full process isolation in Ring 3) is the appropriate boundary.
Tier 1 trust requirement: Tier 1 drivers run in Ring 0 with only domain isolation (not address space isolation). They must be treated as trusted code: cryptographically signed, manifest-verified (Section 4), and subject to the same security review standard as Core kernel code. Tier 1 is not appropriate for third-party, untrusted, or unaudited drivers. Untrusted drivers must use Tier 2 (Ring 3 process isolation) where a compromised driver cannot escalate to kernel privilege regardless of the exploit technique. See Section 6.4.2 (Signal Delivery Across Isolation Boundaries) for the complete domain crossing specification during signal handling.
Mitigations that raise the bar for exploitation are detailed in Section 3
("WRPKRU Threat Model: Unprivileged Domain Escape"): binary scanning for unauthorized
WRPKRU/XRSTOR instructions at load time, W^X enforcement on driver code pages,
forward-edge CFI (Clang -fsanitize=cfi-icall), and the NMI watchdog for detecting
PKRU state mismatches.
Future: PKS (Protection Keys for Supervisor) -- Intel's PKS extension provides
supervisor-mode protection keys that are controlled via MSR writes (privileged
operations that require Ring 0 + CPL 0 MSR access). Unlike WRPKRU (which any Ring 0
code can execute), PKS key modifications go through WRMSR to IA32_PKS, which can
be trapped by a hypervisor or controlled by isle-core. When PKS-capable hardware is
available, ISLE will use PKS for Tier 1 isolation, closing the unprivileged-WRPKRU
escape path. PKS is available on Intel Sapphire Rapids and later server CPUs.
Protection Key Exhaustion (Hardware Domain Limit)
Intel MPK provides only 16 protection keys (PKEY 0-15). With PKEY 0 reserved for ISLE Core, PKEY 1 for shared read-only descriptors, PKEY 14 for shared DMA, and PKEY 15 as guard, only 12 keys (PKEY 2-13) are available for Tier 1 driver domains (see Section 3, "MPK Domain Allocation"). This limits the number of independently isolated Tier 1 drivers to 12 on x86-64 with MPK. Architectures with equivalent mechanisms (AArch64 POE: 7 usable domains, ARMv7 DACR: 15 usable domains, PPC32 segments: 15 usable) face the same constraint. This is a hard hardware limit that cannot be worked around without changing the isolation granularity. PPC64LE (Radix PID) and RISC-V (page-table fallback) use process-scoped isolation without a fixed small domain budget, so domain exhaustion does not apply to those architectures — but they pay higher per-switch costs (see Section 3 isolation cost table).
When domains are exhausted (more concurrent Tier 1 drivers than available hardware domains — 12 on x86 MPK, 7 on AArch64 POE, 15 on ARMv7 DACR, 15 on PPC32 segments), ISLE applies three strategies in priority order:
-
Domain grouping (default): Related drivers share a protection key. For example, all block storage drivers (NVMe, AHCI, virtio-blk) share one key, all network drivers (NIC, TCP/IP stack) share another. Grouping reduces isolation granularity -- a bug in one block driver can corrupt another block driver's memory within the same group -- but preserves isolation between groups (network cannot corrupt storage). Grouping policy is configurable via the driver manifest:
toml [driver.isolation] isolation_group = "block" # Share isolation domain with other "block" group drivers -
Automatic Tier 2 demotion: Drivers below a configurable priority threshold are demoted to Tier 2 (process isolation) when all hardware isolation domains are consumed. Only the most performance-critical drivers retain Tier 1 placement. The priority is determined by
match_priorityin the driver manifest -- higher priority retains Tier 1. -
Domain virtualization (future): On context switch, the scheduler can save and restore the isolation domain register (PKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, segment registers on PPC32) along with a remapped domain assignment table, allowing more logical domains than hardware provides by time-multiplexing physical domains. Domain virtualization adds overhead to context switches (~50-100 cycles for the register save/restore and domain table lookup) and is used only when strategies 1 and 2 are insufficient. This is a future optimization -- domain grouping and Tier 2 demotion handle all current deployment scenarios.
Practical impact: A typical server has 5-8 performance-critical driver types (NVMe, NIC, TCP/IP, filesystem, GPU, KVM, virtio, crypto). With grouping, these fit within the hardware domain budget on x86 (12 domains), ARMv7 (15), and PPC32 (15) with room to spare. On AArch64 with POE (7 domains), a typical 5-8 driver configuration requires at least one grouping (e.g., NVMe + filesystem share a domain). Systems with unusually many distinct Tier 1 drivers (e.g., multi-vendor NIC + storage + GPU + FPGA configurations) trigger Tier 2 demotion for the lowest-priority drivers.
Long-term trajectory: the domain budget pressure diminishes as devices become
peers — but this is a multi-year ecosystem shift, not a near-term fix. The
devices that consume the most Tier 1 domain slots today — GPU (~700K lines of
driver code), high-end NIC/DPU (~150K lines), and high-throughput storage
controllers — are exactly the devices most suited to become ISLE multikernel
peers (§47.2.2). When a device runs its own ISLE kernel and participates as a
cluster peer, it is handled entirely by isle-peer-transport (~2K lines) and
consumes zero MPK domains; it exits the Tier 1 population entirely and is
contained by the IOMMU hard boundary instead.
However, ISLE cannot assume vendor adoption. Rewriting device firmware to implement ISLE message passing requires vendor investment, ecosystem tooling, and standardization effort that will take years to mature. For the foreseeable future, most devices will continue to use traditional Tier 1 and Tier 2 drivers, and the domain budget strategies above (grouping, Tier 2 demotion, domain virtualization) are the primary long-term solution — not a temporary workaround. Domain virtualization (strategy 3) and PKS (§6, future work) remain genuinely important during this extended transition window and must be implemented correctly. They cannot be dismissed as "probably never needed."
The peer kernel model is the correct direction — it reduces the Tier 1 population, eliminates device-specific Ring 0 code, and strengthens the isolation boundary — but ISLE must operate correctly and efficiently with today's hardware for years before that future materializes. Domain grouping and automatic Tier 2 demotion are therefore the primary and durable strategies. The ecosystem shift toward peer kernels is a beneficial long-term trend that will progressively ease the domain budget, not a solution that ISLE can depend on today.
Tier 2: User-Space Drivers (Process-Isolated)
Non-performance-critical drivers run as user-space processes with full address space isolation. Communication with ISLE Core uses:
- Shared-memory ring buffers (mapped into both address spaces)
- Lightweight notification via eventfd-like mechanism
- IOMMU-restricted DMA (driver can only DMA to its allocated regions)
Tier 2 MMIO access model. Tier 2 drivers access device MMIO registers via
isle_driver_mmio_map (Section 6, KABI syscall table), which maps a device BAR region
into the driver process's address space. This mapping is direct -- the driver reads
and writes device registers without kernel mediation on each access, avoiding per-access
syscall overhead. However, the mapping is kernel-controlled and revocable:
-
Setup-time validation. The kernel validates every
isle_driver_mmio_maprequest: the BAR index must belong to the driver's assigned device, the offset and size must fall within the BAR's bounds, and the driver must hold the appropriate device capability. The kernel never maps BARs belonging to other devices or kernel-reserved MMIO regions. -
IOMMU containment. Even though the driver can program device registers via MMIO (including registers that initiate DMA), all DMA transactions from the device pass through the IOMMU. The device's IOMMU domain restricts DMA to regions explicitly allocated by the kernel on behalf of the driver (
isle_driver_dma_alloc). A compromised Tier 2 driver that programs arbitrary DMA addresses into device registers will trigger IOMMU faults -- the DMA is blocked by hardware, not by software trust. This is the same IOMMU fencing applied to Tier 1 drivers, and it is the primary defense against DMA-based attacks from any driver tier. -
MMIO revocation on containment. When the kernel needs to contain a Tier 2 driver (crash, fault, admin action, or auto-demotion), it unmaps all MMIO regions from the driver process's address space as part of the containment sequence. This is a standard virtual memory operation (page table entry removal + TLB invalidation) that completes in microseconds. After MMIO revocation, any subsequent MMIO access by the driver process triggers a page fault and process termination -- the driver cannot issue further device commands. Combined with IOMMU fencing (which blocks DMA initiated before revocation from reaching non-driver memory), MMIO revocation provides a complete device access cutoff without requiring Function Level Reset.
PCIe peer-to-peer DMA and IOMMU group policy -- The "complete device access cutoff" guarantee above depends on all DMA traffic passing through the IOMMU. This holds when the device is in its own IOMMU group (ACS enabled on all upstream PCIe switches). However, devices behind a non-ACS PCIe switch can perform peer-to-peer DMA that bypasses the IOMMU entirely — a contained device could still DMA to a peer device's memory regions without IOMMU interception. ISLE addresses this by enforcing an IOMMU group co-isolation policy: when devices share an IOMMU group (no ACS), ISLE places all devices in that group under the same Tier 2 driver process (or co-isolates them in the same Tier 1 domain). IOMMU revocation during containment therefore affects the entire group atomically — there is no "partially contained" state where one device in the group is fenced but a peer is not. See Section 7.3.8 (IOMMU Groups) for the full ACS detection and group assignment policy.
Synchronous vs. asynchronous revocation -- For deliberate containment actions (admin-initiated revocation, auto-demotion, fault-triggered isolation), MMIO revocation is synchronous: the kernel performs the TLB shootdown and waits for acknowledgment from all CPUs before the containment call returns. This guarantees that no MMIO access from the driver process is possible after the containment operation completes. For the crash case (driver process dies due to SIGSEGV/SIGABRT), the dying process's threads are killed first, so the TLB shootdown is a cleanup operation -- the driver threads are no longer executing, making the timing of the shootdown a correctness concern only for the page allocator (which must not reuse the MMIO-mapped pages until the shootdown completes).
- FLR-free recovery (optimistic path). In the normal case, Tier 2 recovery does
not require Function Level Reset. Tier 1 recovery requires FLR because the driver
runs in Ring 0 and may have left the device in an arbitrary hardware state that only
a full reset can clear. Tier 2 recovery can typically avoid FLR because: (a) IOMMU
containment prevents DMA escapes regardless of device state, (b) MMIO revocation
prevents further device manipulation, and (c) the device's hardware state can be
re-initialized by the replacement driver instance during its
init()call. However, devices with complex internal state machines (GPUs, SmartNICs, FPGAs) may not be safely re-initializable without a full reset. If the replacement driver'sinit()detects an unresponsive or inconsistent device (no response to MMIO reads, unexpected register state, completion timeout), the registry escalates to FLR. This fallback is not the common case for simple devices (NICs, HID, storage controllers), but should be expected for complex devices with substantial internal firmware state.
Tier Mobility and Auto-Demotion
Key principle: ISLE's isolation model is designed for flexibility, not dogma. Different hardware has different isolation capabilities (see Section 3 in README.md for the full architecture-specific analysis). The tier system allows administrators to make explicit tradeoffs between isolation and performance:
-
Tier 1 provides fast isolation on hardware with MPK/POE/DACR/segments (~1-4% overhead), but falls back to page-table isolation on hardware without these features (~6-20% overhead). On RISC-V, Tier 1 offers only weak isolation.
-
Tier 2 provides strong process-level isolation on all architectures, at the cost of higher latency (~200-600 cycles per domain crossing vs ~23-80 cycles for Tier 1).
-
The escape hatch is always available: Any Tier 1 driver can be manually demoted to Tier 2 by the administrator, or automatically demoted after repeated crashes. This allows environments that prioritize security over performance to opt into stronger isolation regardless of hardware capabilities.
Design intent: The system does not force a one-size-fits-all choice. A high-frequency trading system on x86_64 might run all drivers in Tier 1 for maximum performance. A secure enclave handling sensitive data on a RISC-V system might run all drivers in Tier 2 for maximum isolation. Both are valid deployments of the same kernel.
Drivers declare a preferred tier and a minimum tier in their manifest:
# drivers/tier1/nvme/manifest.toml
[driver]
name = "isle-nvme"
preferred_tier = 1
minimum_tier = 1 # NVMe cannot function well in Tier 2
# drivers/tier2/usb-hid/manifest.toml
[driver]
name = "isle-usb-hid"
preferred_tier = 2
minimum_tier = 2
The kernel's policy engine decides the actual tier based on:
- Trust level: Unsigned drivers are forced to Tier 2.
- Crash history: After 3 crashes within a configurable window, a Tier 1 driver is automatically demoted to Tier 2 (if minimum_tier allows).
- Admin overrides: System administrator can force any tier via configuration.
- Signature verification: Cryptographically signed drivers can be granted Tier 1.
Debugging Across Isolation Domains (ptrace)
ptrace(PTRACE_PEEKDATA) on a Tier 1 driver thread must read memory tagged with the
driver's PKEY, which the debugger process does not have access to. The kernel handles
this by performing the read on behalf of the debugger:
ptrace access flow for MPK-isolated memory (high-level overview):
1. Debugger calls ptrace(PTRACE_PEEKDATA/POKEDATA, target_tid, addr).
2. Kernel checks: does `addr` belong to a MPK-protected region?
3. If yes: kernel performs a TOCTOU-safe PKRU manipulation
(see Security Note below) to grant temporary access,
performs the copy, then restores PKRU. This happens in kernel mode,
so the debugger process never gains direct access.
4. If no: standard ptrace read/write path (no MPK involvement).
ptrace write flow:
Same as read, but with write permission instead of read.
PKRU manipulation is a single WRPKRU instruction (~23 cycles per Section 3).
PTRACE_ATTACH to a Tier 1 driver thread:
Requires CAP_SYS_PTRACE (same as Linux).
The debugger can single-step, set breakpoints, and inspect registers.
Memory access goes through the kernel-mediated PKRU path above.
#### Security Note: TOCTOU Mitigation
The ptrace PKRU manipulation flow has a Time-Of-Check-Time-Of-Use (TOCTOU) concern:
the kernel checks access, changes PKRU, performs the copy, then restores PKRU.
Between the PKRU change and restore, if the traced driver could execute arbitrary code,
it could issue its own `WRPKRU` and escape isolation.
**Mitigation strategy:**
ptrace PKRU-protected access (TOCTOU-safe): 1. Acquire pt_reg_lock(target_tid) — traced thread cannot run. 2. Verify debugger holds CAP_SYS_PTRACE and ptrace relationship is authorized. This check happens before any PKRU state change. 3. Verify address belongs to a valid MPK region owned by target. 4. With IRQs disabled and pt_reg_lock held: a) Save current PKRU b) Set PKRU to grant temporary access to target's PKEY c) Perform the copy (read or write) d) Restore saved PKRU 5. Release pt_reg_lock(target_tid)
This approach creates a **locked validation window**: the traced process cannot execute
between authorization and data copy, and cannot escape by issuing its own `WRPKRU`
because it is blocked by `pt_reg_lock`. The authorization check occurs before any
PKRU manipulation, ensuring that unauthorized debuggers cannot exploit the window.
**Alternative approaches considered:**
1. **Permanently grant debugger PKRU access**: Rejected — violates isolation principle.
2. **Copy through a bounce buffer with kernel mapping**: Adds overhead but would work;
however, PKRU manipulation is fast (~23 cycles) and the lock-based approach is
simpler when the debugger is already ptrace-attached.
3. **Disable PTRACE_PEEKDATA on Tier 1 drivers**: Would compromise debuggability;
the lock-based approach provides security without removing functionality.
The key invariant is: *no user-space code from the traced process runs between PKRU
authorization and PKRU restoration*. `pt_reg_lock` enforces this invariant.
Signal Delivery Across Isolation Boundaries
When a signal targets a thread running in a Tier 1 (domain-isolated) driver:
Signal delivery to Tier 1 driver thread:
SIGKILL / SIGSTOP (non-catchable):
Kernel handles these directly — no signal frame is pushed.
For SIGKILL: the driver thread is terminated. The kernel runs
the driver's cleanup handler (if registered via KABI) in a
bounded context (timeout: 100ms). If cleanup doesn't complete,
the driver's isolation domain is revoked and all its memory freed.
Catchable signals (SIGSEGV, SIGUSR1, etc.):
1. Kernel saves driver's PKRU state.
2. Kernel sets PKRU to the process's default domain (no driver
memory access) before pushing the signal frame to the user stack.
3. Signal handler runs in the process's normal domain — it cannot
access driver-private memory.
4. On sigreturn: kernel restores the saved PKRU and resumes the
driver code with its original domain permissions.
This ensures a signal handler in application code cannot accidentally
(or maliciously) access driver-private memory while handling a signal
that interrupted driver execution.
See also: Section 48 (SmartNIC and DPU Integration) adds an offload tier where driver data-plane operations are proxied to a DPU over PCIe or shared memory, using the same tier classification and IOMMU fencing model.
eBPF Interaction with Driver Isolation Domains
eBPF programs are a cross-cutting kernel extensibility mechanism used for tracing (kprobe, tracepoint), networking (XDP, tc), security (LSM, seccomp), and scheduling (struct_ops). Because eBPF programs execute in kernel mode with access to kernel data structures, their interaction with driver isolation domains requires explicit specification to prevent isolation domain circumvention.
Threat model: An eBPF program, if not properly constrained, could: 1. Access Tier 1/Tier 2 driver memory directly without going through the isolation boundary 2. Bypass MPK/POE protections by running in the same domain as isle-core 3. Modify driver state without proper capability checks 4. Exfiltrate data from isolated driver memory to user space via BPF maps
Isolation architecture: eBPF programs do not run in the same isolation domain as isle-core (PKEY 0). Each loaded eBPF program is assigned to a dedicated BPF isolation domain that is distinct from: - isle-core (PKEY 0) - All Tier 1 driver domains (PKEY 2-13 on x86-64) - The shared DMA domain (PKEY 14) - The guard domain (PKEY 15)
This means eBPF programs cannot directly access driver-private memory, isle-core internal state, or any isolation domain's memory without explicit kernel mediation.
Access rules for eBPF programs:
-
No direct driver memory access: An eBPF program attached to a kprobe or tracepoint within a Tier 1 driver's code path executes in its own BPF domain, not the driver's domain. The BPF program cannot read or write the driver's private heap, stack, or MMIO-mapped device registers. Any access to driver state must go through BPF helper functions that perform cross-domain access on the program's behalf.
-
BPF helper mediation: All BPF helpers that access kernel or driver state (e.g.,
bpf_probe_read_kernel(),bpf_sk_lookup(),bpf_ct_lookup()) are implemented as kernel-mediated cross-domain operations. The helper: - Validates that the target memory region belongs to a domain for which the BPF program's domain holds the appropriate capability (see rule 4)
- Copies data between the target domain and the BPF program's stack or map memory using kernel-internal mappings that bypass domain restrictions
-
Returns an error if the capability check fails or the access is out of bounds
-
Map isolation: BPF maps created by an eBPF program are owned by that program's BPF domain. Other isolation domains (including drivers) cannot access these maps without an explicit capability grant. Cross-domain map sharing follows the standard capability delegation mechanism (Section 11.1): the BPF domain must grant
MAP_READand/orMAP_WRITEcapabilities to the target domain. This prevents a compromised driver from exfiltrating data through BPF maps it does not own. -
Capability requirements for driver access: BPF helpers that query or modify driver state require the BPF domain to hold the appropriate capability:
bpf_skb_adjust_room()(modify packet buffer in NIC driver): requiresCAP_NET_RAWin the caller's network namespacebpf_xdp_adjust_head()/bpf_xdp_adjust_tail(): requiresCAP_NET_RAW-
Helpers that read driver statistics or state: require
CAP_SYS_ADMINor a subsystem-specific read capability The verifier rejects at load time any program that calls a helper for which the loading context (the process callingbpf()) does not hold the required capabilities. The eBPF runtime re-checks capabilities at helper invocation time to handle capability revocation after program load. -
XDP and driver datapath: XDP programs attached to a NIC driver's receive path do not execute in the NIC driver's isolation domain. Instead:
- The driver's receive handler (running in the driver's domain) copies the packet descriptor into a shared bounce buffer accessible to the BPF domain
- The XDP program runs in the BPF domain, reading from and writing to the bounce buffer
- Return values (
XDP_PASS,XDP_DROP,XDP_TX,XDP_REDIRECT) are communicated back to the driver via a shared-memory return code -
If the XDP program modifies the packet (
XDP_TXorXDP_REDIRECTwith modified data), the driver copies the modified packet back to its own domain before transmission or redirect This bounce-buffer design ensures the XDP program never directly accesses driver-private state (DMA rings, completion queues, device registers). -
TC (traffic control) BPF: Same model as XDP — TC programs execute in a BPF domain, not in the network driver's or isle-net's domain. Packet data is copied through a shared buffer; the program cannot access isle-net's socket buffers, routing tables, or connection tracking state except through verified BPF helpers (
bpf_fib_lookup(),bpf_ct_lookup(), etc.) that perform capability-checked cross-domain access. -
Kprobe and tracepoint attachment to drivers: When a BPF program is attached to a kprobe within a Tier 1 driver's code:
- The kprobe fires while the CPU is running in the driver's isolation domain
- The BPF program is invoked after the kernel switches to the BPF domain
- The program receives only the function arguments (copied to BPF stack) and cannot access the driver's heap, globals, or MMIO regions
-
Return probes (kretprobe) receive the return value copied to BPF stack The domain switch before BPF execution and the argument copy are performed by the kprobe infrastructure in isle-core, ensuring the BPF program is fully contained within its own domain.
-
LSM BPF and security hooks: LSM BPF programs attached to security hooks (file open, socket create, etc.) run in a BPF domain. They cannot access the credentials, file descriptors, or socket state of the process that triggered the hook except through BPF helpers (
bpf_get_current_pid_tgid(),bpf_get_current_cred(), etc.) that copy the relevant data into the BPF program's memory. Security decisions (allow/deny) are returned via an integer return code; the program cannot directly modify kernel security state.
Domain allocation for BPF: On x86-64, BPF domains are allocated from the same PKEY pool as Tier 1 drivers (PKEY 2-13). Typical systems run 5-8 Tier 1 driver domains, leaving 4-7 domains for BPF programs. When domain exhaustion occurs (drivers + BPF programs > 12 domains), BPF programs share a common BPF domain rather than each getting a dedicated domain. This reduces isolation granularity between BPF programs but preserves isolation between BPF and drivers and between BPF and isle-core. BPF-to-BPF isolation is a best-effort optimization, not a security guarantee — BPF programs are verified code with bounded execution, and their primary isolation boundary is BPF-to-driver and BPF-to-core, both of which are always maintained regardless of domain pressure. On architectures without a fixed domain limit (PPC64LE, RISC-V with page-table fallback), each BPF program gets its own domain.
Crash handling: A crash (verifier bug, JIT bug, or helper bug) within a
BPF program triggers the same containment as a Tier 1 driver crash:
- The BPF domain is revoked
- All maps owned by that domain are invalidated (subsequent lookups return
-ENOENT)
- Attached hooks are automatically detached
- The program is marked as faulted and cannot be re-attached without reload
Unlike Tier 1 drivers, BPF programs do not have a recovery path — they are considered stateless (persistent state lives in maps, which survive program reload). The administrator must reload the program manually or via orchestration.
Full specification: The complete BPF isolation model — domain confinement, map access control, capability-gated helpers, cross-domain packet redirect rules, and verifier enforcement — is specified in Section 35.2 (Packet Filtering, BPF-Based). Although Section 35.2 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. The rules above are a driver-centric summary; Section 35.2 provides the canonical specification.
Tier 2 Interface and SDK
Tier 2 drivers run in separate user-space processes. They communicate with isle-core via dedicated KABI syscalls — not the domain ring buffers used by Tier 1.
KABI syscalls for Tier 2 drivers:
These syscalls use a dedicated syscall range (__NR_isle_driver_base + offset,
allocated from the ISLE-private syscall range defined in Section 20.2). They are
not Linux-compatible syscalls -- they are ISLE-specific and used only by the
Tier 2 driver SDK. The SDK wraps them behind the same KernelServicesVTable
interface that Tier 1 drivers use, so driver code is tier-agnostic.
| KABI Syscall | Syscall Offset | Arguments | Return | Purpose |
|---|---|---|---|---|
isle_driver_register |
0 | manifest: *const DriverManifest, manifest_size: u64, out_services: *mut KernelServicesVTable, out_device: *mut DeviceDescriptor |
IoResultCode |
Register with device registry. Kernel validates manifest, assigns capabilities, returns kernel services vtable and device descriptor. Called once at driver process startup. |
isle_driver_mmio_map |
1 | device_handle: DeviceHandle, bar_index: u32, offset: u64, size: u64, out_vaddr: *mut u64 |
IoResultCode |
Map a device BAR (or portion) into driver address space. Kernel validates BAR ownership, IOMMU group, and capability before creating the mapping. The mapping is revocable: the kernel can unmap it at any time during driver containment (see "Tier 2 MMIO access model" above). |
isle_driver_dma_alloc |
2 | size: u64, align: u64, flags: AllocFlags, out_vaddr: *mut u64, out_dma_addr: *mut u64 |
IoResultCode |
Allocate DMA-capable memory. Kernel allocates physical pages, creates IOMMU mapping, maps into driver process. Returns both virtual and DMA (bus) addresses. |
isle_driver_dma_free |
3 | vaddr: u64, size: u64 |
IoResultCode |
Release a DMA buffer. Kernel tears down IOMMU mapping, unmaps from process, frees physical pages. |
isle_driver_irq_wait |
4 | irq_handle: u32, timeout_ns: u64 |
IoResultCode |
Block until the registered interrupt fires or timeout expires. Returns IO_SUCCESS on interrupt, IO_TIMEOUT on timeout. Uses eventfd internally for efficient wakeup. |
isle_driver_complete |
5 | request_id: u64, status: IoResultCode, bytes_transferred: u64 |
IoResultCode |
Post an I/O completion to isle-core. The completion is forwarded to the originating io_uring CQ or waiting syscall. |
Error codes: All Tier 2 KABI syscalls return IoResultCode (defined in
isle-driver-sdk/src/abi.rs). Common errors: IO_ERR_INVALID_HANDLE (bad device
handle), IO_ERR_PERMISSION (missing capability), IO_ERR_NO_MEMORY (allocation
failed), IO_ERR_BUSY (resource in use), IO_ERR_TIMEOUT.
Performance: Per-I/O overhead floor is ~200-400ns (two syscall transitions). For high-IOPS devices (NVMe, 100GbE), this is significant — those belong in Tier 1. Tier 2 suits devices where overhead is negligible: USB, printers, audio (~1-10ms periods), experimental drivers, and third-party binaries compiled against the stable SDK.
Security boundary: A Tier 2 driver crash is an ordinary process crash. It cannot corrupt kernel memory or issue DMA outside IOMMU-fenced regions. On containment, the kernel revokes all MMIO mappings (preventing further device register access) and tears down IOMMU entries (causing any residual in-flight DMA to fault). The kernel restarts the driver process if the restart policy permits (~10ms recovery).
7. Device Registry and Bus Management
Summary: This section specifies the kernel-internal device registry — a topology-aware tree that tracks all hardware devices, their parent/child relationships, driver bindings, power states, and capabilities. It covers: bus enumeration and matching (§7.4), device lifecycle and hot-plug (§7.6-§7.7), power management ordering (§7.6), crash recovery integration (§7.10), sysfs compatibility (§7.12), and firmware management (§7.15). The registry is the single source of truth for "what hardware exists" and is used by the scheduler (§14), fault manager (§39), DPU offload layer (§48), and unified compute topology (§54). Readers needing only the API surface can skip to §7.3 (data model) and §7.9 (KABI integration).
7.1 Motivation and Prior Art
7.1.1 The Problem
ISLE's KABI provides a clean bilateral vtable exchange between kernel and driver. But the current design has no answer for:
- Device hierarchies: How does the kernel model that a USB keyboard is behind a hub, which is behind an XHCI controller, which sits on a PCI bus? The topology matters for power management ordering, hot-plug teardown, and fault propagation.
- Driver-to-device matching: When the kernel discovers a PCI device with vendor 0x8086 and device 0x2723, how does it know which driver to load? Currently there is no matching mechanism.
- Power management ordering: Suspending a PCI bridge before its child devices causes data loss. The kernel needs to know the topology to get the ordering right.
- Cross-driver services: A NIC may need a PHY driver. A GPU display pipeline may need an I2C controller. There is no way for drivers to discover and use services provided by other drivers.
- Hot-plug: When a USB device is yanked, the kernel must tear down the device, its driver, and all child devices in the correct order.
The key insight from macOS IOKit: the kernel should own the device relationship model. But IOKit's mistake was embedding the model in the driver's C++ class hierarchy, coupling it to the ABI. We build it as a kernel-internal service that drivers access through KABI methods.
7.1.2 What We Learn From Existing Systems
Linux (kobject / bus / device / driver / sysfs):
- Device model is a graph of kobject structures exposed via sysfs.
- Bus types (PCI, USB, platform) each implement their own match/probe/remove.
- Strengths: sysfs gives userspace introspection; uevent mechanism for hotplug.
- Weaknesses: driver matching is bus-specific with no unified property system; power
management ordering is heuristic (dpm_list), not topology-derived; the kobject model
is deeply entangled with kernel internals — drivers directly embed and manipulate
kobjects.
macOS IOKit (IORegistry): - All devices modeled as a tree of C++ objects (IORegistryEntry → IOService → ...). - Matching uses property dictionaries ("matching dictionaries"). - Power management tree mirrors the registry tree — IOPMPowerState arrays per driver. - Strengths: property-based matching is elegant; PM ordering derives from the tree; service publication/lookup via IOService matching. - Weaknesses: C++ class hierarchy is the ABI — changing a base class breaks all drivers (fragile base class problem). This is why Apple deprecated kexts and moved to DriverKit. The matching system is over-general (personality dictionaries are complex). Memory management is manual.
Windows PnP Manager: - Kernel-mode PnP manager maintains a device tree. Device nodes have properties. - INF files declare driver matching rules (declarative, external to the binary). - Power management uses IRP_MN_SET_POWER directed through the tree. - Strengths: INF-based declarative matching is clean; power IRPs propagate with correct ordering; robust hotplug. - Weaknesses: IRP-based model is complex; WDM/WDF driver model is notoriously difficult.
Fuchsia (Driver Framework v2): - "Bind rules" — a simple declarative language — match drivers to devices. - Driver manager runs as a userspace component. Device topology is a tree of nodes in a namespace. - Strengths: clean separation of concerns; bind rules are simple and composable; userspace driver manager can be restarted independently. - Weaknesses: everything going through IPC adds latency; the DFv1-to-DFv2 migration shows that evolving the framework is painful.
7.1.3 ISLE's Position
We take the best ideas from each:
| Concept | Borrowed From | Adaptation |
|---|---|---|
| Property-based matching | IOKit | Declarative match rules in driver manifest, not runtime OOP matching |
| Registry as a tree | IOKit, Linux | Kernel-internal tree, drivers get opaque handles only |
| PM ordering from topology | IOKit, Windows | Topological sort of device tree, timeouts at each level |
| Service publication/lookup | IOKit | Mediated by registry through KABI, not direct object references |
| Sysfs-compatible output | Linux | Registry is the single source of truth for /sys |
| Uevent hotplug notifications | Linux | Registry emits Linux-compatible uevents |
| Declarative bind rules | Fuchsia | Match rules embedded in driver ELF binary |
What we take from none of them: the registry is a kernel-internal data structure.
Drivers never see it directly. They interact through opaque DeviceHandle values
and KABI vtable methods. No OOP inheritance, no C++ objects, no kobject embedding, no
global symbol tables. The flat, versioned, append-only KABI philosophy is fully preserved.
7.2 Design Principles
-
Kernel owns the graph, drivers own the hardware logic. The registry manages topology, matching, lifecycle, and power ordering. Drivers manage hardware registers, DMA, and device-specific protocols. Clean separation.
-
Drivers are leaves, not framework participants. A driver does not subclass a framework object. It fills in a vtable and receives callbacks. The registry decides when to call those callbacks based on topology and policy.
-
No ABI coupling. The registry is kernel-internal. Drivers interact with it through KABI methods appended to
KernelServicesVTable. If the registry's internal data structures change, no driver recompilation is needed. -
Topology drives policy. Power management ordering, hot-plug teardown, crash recovery cascading, and NUMA affinity are all derived from the device tree topology. No heuristics, no manually maintained ordering lists.
-
Capability-mediated access. All cross-driver interactions go through the registry, which validates capabilities and handles tier transitions (isolation domain switches, user-kernel IPC). Drivers never communicate directly.
7.3 Registry Data Model
7.3.1 DeviceNode
The fundamental unit is a DeviceNode — a kernel-internal structure that drivers never
see directly.
Heap allocation requirement:
DeviceNodeand its child structures (Vec,String,HashMapinPropertyTableandDeviceRegistry) require heap allocation. The device registry is initialized at boot step 4g (Section 7.11), which is after the physical memory allocator and virtual memory subsystem are running (steps 4b-4c). Tier 0 devices (APIC, timer, serial) that are needed before heap init do not use the registry — they are registered retroactively after registry init (Section 7.11.1). No registry data structures are used during early boot before the heap is available.
// Kernel-internal — NOT part of KABI
pub struct DeviceNodeId(pub u64); // Unique, monotonically increasing, never reused
pub struct DeviceNode {
// Identity
id: DeviceNodeId,
name: ArrayString<64>, // e.g., "pci0000:00", "0000:00:1f.2", "usb1-1.3"
// Tree structure
parent: Option<DeviceNodeId>,
children: Vec<DeviceNodeId>, // Ordered by discovery time
// Service relationships (non-tree edges)
providers: Vec<ServiceLink>, // Services this node consumes
clients: Vec<ServiceLink>, // Nodes that consume services from this node
// Device identity
bus_type: BusType, // Reuses existing BusType from abi.rs
bus_identity: BusIdentity, // Bus-specific ID (PCI IDs, USB descriptors, etc.)
properties: PropertyTable, // Key-value property store
// Lifecycle
state: DeviceState,
driver_binding: Option<DriverBinding>,
// Placement
numa_node: i32, // -1 = unknown
// Power
power_state: PowerState,
runtime_pm: RuntimePmPolicy,
// Security
device_cap: CapHandle, // Capability for this device
// Resources
resources: DeviceResources, // BAR mappings, IRQs, DMA state
// IOMMU
iommu_group: Option<IommuGroupId>, // Shared IOMMU group (for passthrough)
// Reliability
/// Sliding-window failure tracker. Records timestamps of recent failures
/// in a circular buffer (capacity: 16 entries). The demotion policy checks
/// how many failures occurred within the configured window (default: 1 hour).
failure_window: FailureWindow,
last_transition_ns: u64,
// State buffer integrity
/// HMAC-SHA256 key for state buffer integrity verification.
/// Generated by isle-core on first driver load for this DeviceHandle.
/// Persists across driver crash/reload cycles; discarded only on
/// DeviceHandle removal (device unplugged or deregistered).
state_hmac_key: Option<[u8; 32]>,
}
7.3.2 PropertyTable
Properties are the lingua franca of matching and introspection. They serve the same role as IOKit's property dictionaries and Linux's sysfs attributes.
// PropertyValue variants String, Bytes, and StringArray use heap-allocated
// containers. These are only constructed after heap init (boot step 4b+).
// For pre-heap device identification, Tier 0 devices use fixed-size
// ArrayString<64> in BusIdentity (Section 7.3.3) which is stack-allocated.
pub enum PropertyValue {
U64(u64),
I64(i64),
String(String),
Bytes(Vec<u8>),
Bool(bool),
StringArray(Vec<String>),
}
/// Stored as a sorted Vec for cache-friendly iteration and binary search.
/// Device nodes rarely have more than ~30 properties.
pub struct PropertyTable {
entries: Vec<(PropertyKey, String, PropertyValue)>,
}
Standard property keys (well-known constants):
| Key | Type | Description | Set By |
|---|---|---|---|
"bus-type" |
String | "pci", "usb", "platform", "virtio" |
Bus enumerator |
"vendor-id" |
U64 | PCI/USB vendor ID | Bus enumerator |
"device-id" |
U64 | PCI/USB device ID | Bus enumerator |
"subsystem-vendor-id" |
U64 | PCI subsystem vendor | Bus enumerator |
"subsystem-device-id" |
U64 | PCI subsystem device | Bus enumerator |
"class-code" |
U64 | PCI class code / USB class | Bus enumerator |
"revision-id" |
U64 | Hardware revision | Bus enumerator |
"compatible" |
StringArray | DT/ACPI compatible strings | Firmware parser |
"device-name" |
String | Human-readable name | Bus enumerator |
"driver-name" |
String | Name of bound driver | Registry |
"driver-tier" |
U64 | Current isolation tier | Registry |
"numa-node" |
I64 | NUMA node ID | Topology scanner |
"location" |
String | Physical topology path (e.g., PCI BDF) | Bus enumerator |
"serial-number" |
String | Device serial if available | Bus enumerator |
Properties set by "Bus enumerator" are populated during device discovery by whatever code enumerates the bus (PCI config space scan, USB hub status, ACPI namespace walk). Properties set by "Registry" are managed by the kernel. Drivers can set custom properties on their own device node via KABI.
7.3.3 BusIdentity
A union-like enum holding bus-specific identification. Derives from the existing
PciDeviceId in the driver SDK.
pub enum BusIdentity {
Pci {
segment: u16,
bus: u8,
device: u8,
function: u8,
id: PciDeviceId, // Existing type from abi.rs
},
Usb {
bus_num: u16,
port_path: [u8; 8], // Hub topology chain
port_depth: u8,
vendor_id: u16,
product_id: u16,
device_class: u8,
device_subclass: u8,
device_protocol: u8,
interface_class: u8,
interface_subclass: u8,
interface_protocol: u8,
},
Platform {
compatible: ArrayString<64>, // ACPI _HID or DT compatible
unit_id: u64, // ACPI _UID or DT unit address
},
VirtIo {
device_type: u32,
vendor_id: u32,
device_id: u32,
},
}
7.3.4 Service Links
Non-tree edges representing provider-client relationships between devices:
pub struct ServiceLink {
service_name: ArrayString<64>, // e.g., "phy", "i2c", "gpio", "block"
node_id: DeviceNodeId,
cap_handle: CapHandle, // Capability for mediated access
}
7.3.5 Tree Structure Example
Root
+-- acpi0 (ACPI namespace root)
| +-- pci0000:00 (PCI host bridge, segment 0, bus 0)
| | +-- 0000:00:1f.0 (ISA bridge / LPC)
| | +-- 0000:00:1f.2 (SATA controller)
| | | +-- ata0 (ATA port 0)
| | | | +-- sda (disk)
| | | +-- ata1 (ATA port 1)
| | +-- 0000:00:14.0 (USB XHCI controller)
| | | +-- usb1 (USB bus)
| | | | +-- usb1-1 (hub)
| | | | | +-- usb1-1.1 (keyboard)
| | | | | +-- usb1-1.2 (mouse)
| | +-- 0000:03:00.0 (NVMe controller)
| | | +-- nvme0n1 (NVMe namespace 1)
| | +-- 0000:04:00.0 (NIC - Intel i225)
| | | ...provider-client link: "phy" --> phy0 (not a child)
+-- platform0 (Platform device root)
+-- serial0 (Platform UART)
+-- phy0 (Platform PHY device)
Two types of edges:
-
Parent-Child (structural containment): A PCI device is a child of a PCI bridge. A USB device is a child of a USB hub. This is the primary tree structure.
-
Provider-Client (service dependency): Lateral edges. A NIC is a client of a PHY's "phy" service. A GPU display driver is a client of an I2C controller's "i2c" service. These edges do not form cycles (enforced by the registry).
7.3.6 The Registry
// DeviceRegistry uses BTreeMap, HashMap, Vec, and VecDeque — all heap-allocated.
// The registry is initialized at boot step 4g (Section 7.11), after the heap
// is available. It is never accessed before heap init.
pub struct DeviceRegistry {
/// All nodes, indexed by ID.
nodes: BTreeMap<DeviceNodeId, DeviceNode>,
/// Next node ID (monotonically increasing).
next_id: AtomicU64,
/// Index: bus identity --> node ID (fast device lookup).
bus_index: HashMap<BusLookupKey, DeviceNodeId>,
/// Index: property key+value --> set of node IDs (for matching).
property_index: HashMap<PropertyKey, Vec<DeviceNodeId>>,
/// Index: driver name --> set of node IDs (for crash recovery).
driver_index: HashMap<ArrayString<64>, Vec<DeviceNodeId>>,
/// Registered match rules from all known driver manifests.
match_rules: Vec<MatchRegistration>,
/// Pending hotplug events.
hotplug_queue: VecDeque<HotplugEvent>,
/// Power management state.
power_manager: PowerManager,
}
The registry lives entirely within ISLE Core. It is never exposed as a data structure to drivers.
7.3.7 DeviceResources
Each device node tracks its allocated hardware resources. This is the kernel-internal
counterpart of what Linux spreads across struct resource, struct pci_dev fields,
and struct msi_desc lists.
/// Hardware resources allocated to a device. Kernel-internal, NOT part of KABI.
pub struct DeviceResources {
/// PCI Base Address Register mappings (up to 6 BARs per PCI function).
pub bars: [Option<BarMapping>; 6],
/// Interrupt allocations (legacy, MSI, or MSI-X vectors).
pub irqs: Vec<IrqAllocation>,
/// Number of pages currently pinned for DMA by this device.
/// Page reclaim (Section 13) checks this count before attempting to compress
/// or swap a page — DMA-pinned pages are never eligible.
pub dma_pin_count: AtomicU32,
/// Maximum DMA-pinnable pages for this device (enforced by cgroup and
/// per-device limits). 0 = unlimited.
pub dma_pin_limit: u32,
/// MMIO regions mapped for this device (non-BAR, e.g., firmware tables).
pub mmio_regions: Vec<MmioRegion>,
/// Legacy I/O port ranges (x86 only, rare in modern hardware).
pub io_ports: Vec<IoPortRange>,
/// DMA address mask — how many bits of physical address the device can
/// generate. Determines bounce buffer requirements.
pub dma_mask: u64, // e.g., 0xFFFFFFFF for 32-bit DMA
pub coherent_dma_mask: u64, // For coherent (non-streaming) DMA
}
pub struct BarMapping {
pub bar_index: u8,
pub phys_addr: u64,
pub size: u64,
pub flags: BarFlags,
/// Kernel virtual address if mapped. None = not yet mapped (lazy).
pub mapped_vaddr: Option<u64>,
}
bitflags::bitflags! {
#[repr(transparent)]
pub struct BarFlags: u32 {
const MEMORY_64 = 1 << 0; // 64-bit MMIO (vs 32-bit)
const IO_PORT = 1 << 1; // I/O port space (legacy x86)
const PREFETCHABLE = 1 << 2; // Can be mapped write-combining
}
}
pub struct IrqAllocation {
pub irq_type: IrqType,
pub vector: u32, // Global IRQ vector number
pub cpu_affinity: Option<u32>, // Preferred CPU for this interrupt
}
#[repr(u32)]
pub enum IrqType {
LegacyPin = 0, // INTx (shared, level-triggered)
Msi = 1, // Message Signaled Interrupt (single vector)
MsiX = 2, // MSI-X (independent vectors, per-queue)
}
pub struct MmioRegion {
pub phys_addr: u64,
pub size: u64,
pub cacheable: bool,
}
pub struct IoPortRange {
pub base: u16,
pub size: u16,
}
DMA pin counting is a critical safety mechanism:
- Every
dma_map_*()call through KABI increments the device'sdma_pin_count. - Every
dma_unmap_*()call decrements it. - The page reclaim path (Section 13) checks whether a page's owning device has active DMA pins before attempting compression or swap-out. Pages with active DMA mappings are unconditionally skipped — moving a page while a device is DMAing to it would cause silent data corruption.
- On driver crash recovery (Section 7.10), all DMA mappings for the crashed driver are
forcibly invalidated (IOMMU entries torn down), and
dma_pin_countis reset to zero. This is safe because the device has been reset. - The
dma_pin_limitprovides defense-in-depth: a buggy or malicious driver cannot pin all of physical memory for DMA. The limit is enforced by the kernel, not the driver.
Resource lifecycle:
Resources are allocated during device discovery (BARs, IRQs) and driver initialization (DMA mappings, additional MMIO). On device removal or driver crash, all resources are reclaimed by the registry in reverse order: DMA mappings first (IOMMU teardown), then IRQs (free vectors), then BAR unmappings, then MMIO unmappings.
7.3.8 IOMMU Groups
IOMMU groups model hardware isolation boundaries. An IOMMU group is the smallest unit of device isolation that the hardware can enforce — all devices in a group share the same IOMMU domain (page table).
pub struct IommuGroupId(pub u32);
pub struct IommuGroup {
pub group_id: IommuGroupId,
/// Devices in this group. All share the same IOMMU domain.
pub devices: Vec<DeviceNodeId>,
/// Current IOMMU domain assignment.
pub domain: IommuDomainType,
}
pub enum IommuDomainType {
/// Kernel DMA domain — device DMA goes through kernel-managed IOMMU
/// page tables. Default for all devices.
Kernel,
/// VM passthrough domain — entire group assigned to a VM. The VM's
/// IOMMU page tables control device DMA. Used for VFIO passthrough.
VmPassthrough {
vm_id: u64,
/// Second-level page table root (EPT/NPT base).
page_table_root: u64,
},
/// Userspace DMA domain — for Tier 2 drivers that need direct DMA
/// (e.g., DPDK-style networking). IOMMU restricts DMA to the
/// driver process's permitted regions.
UserspaceDma {
owning_pid: u64,
},
}
Why IOMMU groups matter:
-
VFIO passthrough: When assigning a device to a VM (GPU, NIC, NVMe controller, FPGA, etc.), the kernel must assign the entire IOMMU group. If two devices share a group (e.g., GPU and its audio function on the same PCI slot, or NIC and a co-located function), both must be assigned together. The registry validates this constraint before permitting passthrough. See Section 46.2.4 for GPU-specific passthrough details.
-
ACS (Access Control Services): PCIe ACS capabilities determine group boundaries. With ACS, each PCI function can be its own group. Without ACS, all devices behind a non-ACS bridge form a single group (because they could DMA to each other without going through the IOMMU).
-
Isolation guarantee: The IOMMU group is the hardware's isolation primitive. The registry enforces that no device in a passthrough group remains in the kernel domain — this would allow the VM to DMA to the kernel device's memory.
Group discovery:
During PCI enumeration (Section 7.11.2), the registry determines IOMMU groups by walking the PCI topology and checking ACS capability bits:
For each PCI device:
1. Walk upstream to the root port, checking ACS at each bridge.
2. If all bridges have ACS: device is in its own group.
3. If a bridge lacks ACS: all devices below that bridge share a group.
4. Peer-to-peer devices behind the same non-ACS switch: same group.
Passthrough assignment flow:
1. Admin requests device passthrough for VM (via /dev/vfio/N or isle-kvm API)
2. Registry looks up device's DeviceNode → iommu_group
3. Registry checks: all devices in group unbound or assignable?
4. If yes: unbind kernel drivers, switch group to VmPassthrough domain
5. Program IOMMU with VM's second-level page tables
6. VM's guest OS sees the device and loads its own driver
7. On VM teardown: switch back to Kernel domain, rebind kernel drivers
The registry prevents partial group assignment: if device A and device B share IOMMU
group 7, and only A is requested for passthrough, the request is rejected with
-EBUSY unless B is also unbound. This prevents a safety violation where the VM
could DMA to B's kernel-managed memory.
IOMMU Implementation Complexity
IOMMU management is one of the most complex subsystems in any OS kernel, and this complexity should not be understated. The following areas are known to be difficult and are called out explicitly as high-effort implementation items:
Nested/two-level translation (SR-IOV + VFIO) — when a VM uses VFIO passthrough with SR-IOV virtual functions, the IOMMU must perform two-level address translation: guest virtual → guest physical (first level, programmed by the guest's IOMMU driver) then guest physical → host physical (second level, programmed by the host). Intel VT-d calls this "scalable mode with first-level and second-level page tables"; AMD-Vi calls it "guest page tables with nested paging." The two-level walk doubles TLB pressure and introduces a multiplicative page table depth (4-level × 4-level = 16 potential memory accesses per translation miss). IOTLB sizing and invalidation granularity are critical performance levers.
Performance bottlenecks — known IOMMU performance traps: - Map/unmap storm: high-throughput I/O paths (NVMe at millions of IOPS, 100GbE line-rate) can generate millions of IOMMU map/unmap operations per second. Each map/unmap involves IOTLB invalidation. ISLE mitigates this with: (1) persistent DMA mappings for ring buffers (map once at driver init, never unmap), (2) batched invalidation (accumulate invalidations, flush once per batch), (3) per-CPU IOMMU invalidation queues to avoid contention. - IOTLB capacity: hardware IOTLB entries are scarce (~128-512 entries on typical Intel VT-d). Under heavy I/O with many DMA mappings, IOTLB misses add ~100-500ns per translation. Large pages (2MB, 1GB) in IOMMU page tables dramatically reduce IOTLB pressure — ISLE's DMA mapping interface prefers large-page-aligned allocations when possible. - Invalidation latency: IOTLB invalidation on Intel VT-d is not instantaneous. Drain-all invalidation can take ~1-10μs. Page-selective invalidation is faster but not supported on all hardware. ISLE checks hardware capability registers and uses the finest granularity available.
ACS (Access Control Services) — PCIe ACS is required for proper IOMMU group
isolation. Without ACS on a PCIe switch, all devices behind that switch land in the
same IOMMU group (defeating per-device isolation). Many consumer motherboards lack ACS
on the root port or PCIe switch, causing all devices to share one IOMMU group. ISLE
detects this at boot and logs a warning. The pcie_acs_override kernel parameter
(Linux compatibility) allows overriding this for testing, but with an explicit security
warning.
Errata — IOMMU hardware has errata. Intel VT-d errata include broken interrupt remapping on certain steppings, incorrect IOTLB invalidation scope, and non-compliant default domain behavior. ISLE's errata framework (Section 17.4) includes IOMMU errata alongside CPU errata — detected at boot, with workarounds applied automatically.
7.4 Device Matching
7.4.1 Match Rules
Drivers declare what hardware they support through match rules embedded in the driver
binary. Match rules are stored in a dedicated ELF section (.kabi_match) and read by the
kernel loader before init() is called.
/// A single match rule. Drivers can declare multiple rules — any match
/// triggers binding.
#[repr(C)]
pub struct MatchRule {
pub rule_size: u32, // Forward compat
pub match_type: MatchType,
pub data: MatchData, // 128-byte union, interpreted per match_type
}
#[repr(u32)]
pub enum MatchType {
PciId = 0, // Match by PCI vendor/device ID (with wildcards)
PciClass = 1, // Match by PCI class code (with mask)
UsbId = 2, // Match by USB vendor/product ID
UsbClass = 3, // Match by USB class/subclass/protocol
VirtIoType = 4, // Match by VirtIO device type
Compatible = 5, // Match by "compatible" string (DT/ACPI)
Property = 6, // Match by arbitrary property key/value
}
Example — PCI ID match:
#[repr(C)]
pub struct PciMatchData {
pub vendor_id: u16, // 0xFFFF = wildcard
pub device_id: u16, // 0xFFFF = wildcard
pub subsystem_vendor: u16, // 0xFFFF = wildcard
pub subsystem_device: u16, // 0xFFFF = wildcard
pub class_code: u32, // Class code value
pub class_mask: u32, // Bits to compare (0 = ignore class)
}
A match table header in the ELF binary:
#[repr(C)]
pub struct MatchTableHeader {
pub magic: u32, // 0x4D415443 ("MATC")
pub header_size: u32,
pub rule_count: u32,
pub rule_size: u32, // sizeof(MatchRule)
// Followed by `rule_count` MatchRule structs
}
7.4.2 Match Engine
The kernel runs a simple priority-ordered match algorithm:
For each DeviceNode in Discovered state:
1. Collect the node's properties and bus identity
2. For each registered driver (sorted by priority):
a. For each MatchRule in that driver's match table:
- Evaluate the rule against the node's properties
- If match: record (driver, node, specificity) as a candidate
3. Select the candidate with highest specificity
4. If found: begin driver loading for this node
5. If no match: node stays in Discovered state (deferred probe)
Match specificity ranking (highest first):
| Rank | Match Type | Score | Example |
|---|---|---|---|
| 1 | Exact vendor + device + subsystem | 100 | This exact card from this exact OEM |
| 2 | Exact vendor + device ID | 80 | Any board with this chip |
| 3 | Full class code match | 60 | Any NVMe controller (class 01:08:02) |
| 4 | Partial class code (masked) | 40 | Any mass storage controller (class 01:xx:xx) |
| 5 | Compatible string (position-weighted) | 20+ | DT/ACPI compatible, first entry scores higher |
| 6 | Generic property match | 10 | Fallback / catchall |
When two drivers match with equal specificity, the driver with higher match_priority
(declared in its manifest) wins. If still tied, first-registered wins.
7.4.3 Deferred Matching
Some devices cannot be matched immediately — their driver may not yet be loaded (e.g., initramfs not yet mounted, or driver installed later by package manager).
- Devices with no match stay in
Discoveredstate indefinitely. - When a new driver is registered (loaded from initramfs, installed at runtime), all
Discovereddevices are re-evaluated against the new match rules. - A KABI method
registry_rescan()triggers manual re-evaluation.
This is analogous to Linux's deferred probe mechanism, but simpler because the matching is centralized rather than spread across per-bus probe functions.
7.4.4 DriverManifest Extensions
The DriverManifest (defined in isle-driver-sdk/src/capability.rs) gains match-related
fields (appended per ABI rules):
// Appended to DriverManifest
pub match_rule_count: u32, // Number of match rules in .kabi_match section
pub is_bus_driver: u32, // 1 = this driver discovers child devices
pub match_priority: u32, // Higher = preferred when specificity ties
pub _pad: u32,
7.5 Device Lifecycle
7.5.1 State Machine
The registry manages each device through a well-defined state machine. Only the kernel initiates transitions — drivers cannot set their own state.
+-> [Error] ------+----> [Quarantined]
| | |
[Discovered] -> [Matching] -> [Loading] -> [Initializing] -> [Active]
^ ^ | |
| | | v
| +--- (no match) -----------+ [Suspending]
| | |
| +-- (admin re-enable) -- [Quarantined] v
+-- (hotplug rescan) ---- [Removed] [Suspended]
| ^ |
| | v
+-- (driver reload) ----- [Stopping] <-------------- [Resuming]
^ |
| v
[Recovering] <------------- [Active]
#[repr(u32)]
pub enum DeviceState {
Discovered = 0, // Node exists, no driver bound
Matching = 1, // Match engine evaluating
Loading = 2, // Driver binary being loaded
Initializing = 3, // driver init() called, waiting for result
Active = 4, // Driver running normally
Suspending = 5, // Suspend requested, waiting for driver ack
Suspended = 6, // Driver has acknowledged suspend
Resuming = 7, // Resume requested, waiting for driver ack
Stopping = 8, // Driver being stopped (unload, removal, admin)
Recovering = 9, // Driver crashed, recovery in progress
Removed = 10, // Device physically removed (hotplug)
Error = 11, // Fatal error, non-functional
Quarantined = 12, // Driver permanently disabled (crash threshold exceeded);
// requires manual re-enable via sysfs
}
7.5.2 Transition Table
| From | To | Trigger | Driver Callback |
|---|---|---|---|
| Discovered | Matching | New device or new driver registered | None |
| Matching | Loading | Match found | None |
| Matching | Discovered | No match | None |
| Loading | Initializing | Binary loaded, vtable exchange begins | init() |
| Initializing | Active | init() returns success |
None |
| Initializing | Error | init() returns error or timeout |
None |
| Active | Suspending | PM suspend request | suspend() |
| Suspending | Suspended | suspend() returns success |
None |
| Suspending | Error | suspend() timeout or failure |
shutdown() (force) |
| Suspended | Resuming | PM resume request | resume() |
| Resuming | Active | resume() returns success |
None |
| Resuming | Recovering | resume() failure |
None |
| Active | Stopping | Admin request, unload, or hotplug removal | shutdown() |
| Active | Recovering | Fault detected (domain violation, watchdog, crash) | None |
| Recovering | Loading | Recovery initiated, fresh binary load | (fresh init()) |
| Error | Quarantined | Crash threshold exceeded (5+ failures in window) | None |
| Quarantined | Matching | Manual administrator re-enable via sysfs | None |
| Any | Removed | Physical device gone + teardown complete | shutdown() if possible |
7.5.3 Timeouts
Every callback has a timeout. If the driver does not respond within the timeout, the kernel force-stops it (same mechanism as crash recovery: revoke isolation domain / kill process).
| Callback | Tier 1 Timeout | Tier 2 Timeout |
|---|---|---|
init() |
5 seconds | 10 seconds |
shutdown() |
3 seconds | 5 seconds |
suspend() |
2 seconds | 5 seconds |
resume() |
2 seconds | 5 seconds |
All timeouts are configurable via kernel parameters.
7.6 Power Management
7.6.1 Power States
#[repr(u32)]
pub enum PowerState {
D0Active = 0, // Fully operational
D1LowPower = 1, // Low-power idle (quick resume)
D2DeepSleep = 2, // Deeper sleep (longer resume, less power)
D3Off = 3, // Powered off (full re-init on resume)
}
7.6.2 Topology-Driven Ordering
This is the primary advantage of having a kernel-owned device tree. Suspend/resume ordering is derived from topology, not maintained as a separate list.
Suspend order (depth-first, leaves first):
For each subtree rooted at device D:
1. Suspend all clients of D (provider-client links)
2. Recursively suspend all children of D (bottom-up)
3. Suspend D itself
Resume order (exact reverse):
For each subtree rooted at device D:
1. Resume D itself
2. Recursively resume all children of D (top-down)
3. Resume all clients of D
This is computed once by topological sort when a system PM transition begins. Provider- client edges are treated as additional dependency edges in the sort. The result is cached and invalidated when the tree topology changes.
Why this is better than Linux: Linux maintains a dpm_list that approximates
topological order but can get it wrong. The ordering is based on registration order and
heuristic adjustments, not the actual device tree. ISLE computes the correct order
directly from the tree.
7.6.3 PM Failure Handling
When a driver fails to suspend within its timeout:
- Registry marks the node as
Error. - Driver is force-stopped (revoke isolation domain / kill process).
- Suspend continues for remaining devices — one broken driver does not block the entire system.
- On resume, the failed device's driver is reloaded fresh (leveraging crash recovery from Section 9).
- Failure is logged with context for admin diagnosis.
This directly implements the principle from Section 38: "Tier 1 and Tier 2 drivers that fail to suspend within a timeout are forcibly stopped and restarted on resume."
7.6.4 Runtime Power Management
Beyond system suspend, individual devices can enter low-power states when idle:
pub struct RuntimePmPolicy {
pub enabled: bool,
pub idle_timeout_ms: u32, // Enter D1 after this idle period
pub min_state: PowerState, // Deepest state allowed during runtime PM
}
The registry tracks I/O activity per device (through KABI call frequency). When a device
has been idle for idle_timeout_ms, the registry initiates a runtime suspend of that
device alone. Children are only suspended if they are also idle.
Runtime PM is independent of system PM. A device can be in D1 (runtime idle) while the system is fully running.
7.7 Hot-Plug
7.7.1 Bus Drivers as Event Sources
Bus drivers (PCI host bridge, USB XHCI, USB hub) are the source of hotplug events. They detect device arrival/departure and report to the registry through KABI methods.
A bus driver is identified by is_bus_driver = 1 in its DriverManifest. It has the
HOTPLUG_NOTIFY capability (already defined in capability.rs).
7.7.2 Device Arrival
1. Bus driver detects new device
(PCIe hot-add interrupt, USB port status change, ACPI _STA change)
2. Bus driver calls registry_report_device() via KABI
- Passes: parent handle, bus type, bus-specific identity, initial properties
3. Registry creates a new DeviceNode in Discovered state
4. Registry populates properties from the bus driver's report
5. Registry runs the match engine on the new node
6. If match found: load driver, init, transition to Active
7. Registry emits uevent for Linux compatibility (udev/systemd)
7.7.3 Device Removal (Orderly)
1. Bus driver detects device departure (link down, port status change)
2. Bus driver calls registry_report_removal() via KABI
3. Registry processes the subtree bottom-up:
a. For each child (deepest first):
- Stop the child's driver (shutdown callback)
- Release capabilities
- Remove child node
b. Stop the target device's driver
c. Release all capabilities
d. Remove the DeviceNode
4. Registry emits uevent (removal)
7.7.4 Surprise Removal
When a device is physically yanked without warning (e.g., USB unplug during I/O):
- Bus driver detects absence (failed transaction, link down).
- Registry receives the removal report.
- All pending I/O for the device and its children is completed with
-EIO. shutdown()is called on the driver — it may fail quickly because the hardware is gone. This is expected and handled gracefully (timeout → force-stop).- The node subtree is torn down.
This mirrors crash recovery but is initiated by the bus driver rather than by a fault.
7.7.5 Uevent Compatibility
For Linux userspace compatibility (udev, systemd-udevd), the registry emits uevent notifications matching the Linux format:
ACTION=add
DEVPATH=/devices/pci0000:00/0000:03:00.0
SUBSYSTEM=pci
PCI_ID=8086:2723
PCI_CLASS=028000
DRIVER=isle-iwlwifi
This feeds into isle-compat/src/sys/ for sysfs and isle-compat/src/dev/ for
devtmpfs, as outlined in Section 20.3.
7.8 Service Discovery
7.8.1 The Problem
Drivers sometimes need services from other drivers — not through direct communication, but through mediated access. Examples:
- NIC needs a PHY driver (MII bus)
- GPU display pipeline needs I2C controller for DDC/EDID
- RAID controller needs to discover member disks
- Filesystem driver needs its underlying block device
In Linux, each of these has a subsystem-specific mechanism (phylib, i2c_adapter, md_personality, etc.) with its own registration/lookup API. In IOKit, it is done through IOService matching. ISLE unifies service discovery through the registry.
7.8.2 Service Publication
A driver can publish a named service on its device node:
Driver A (e.g., PHY driver):
1. Completes init, device node is Active
2. Calls registry_publish_service("phy", &phy_vtable)
3. Registry records: node A provides service "phy" with given vtable
The phy_vtable is a service-specific C-ABI vtable (same flat, versioned approach as
all other KABI vtables). The registry stores a reference to it.
7.8.3 Service Lookup
A driver can look up a named service:
Driver B (e.g., NIC driver):
1. Needs PHY service
2. Calls registry_lookup_service("phy", scope=ParentSubtree)
3. Registry searches for a node in scope that publishes "phy"
4. Registry validates Driver B has PEER_DRIVER_IPC capability
5. Registry creates a provider-client link (B consumes A's "phy")
6. Registry returns a wrapped service vtable and a ServiceHandle
Lookup scope options:
#[repr(u32)]
pub enum ServiceLookupScope {
Siblings = 0, // Same parent only
ParentSubtree = 1, // Parent and all its descendants
Global = 2, // Entire registry (expensive, rare)
Specific = 3, // A specific node (by DeviceHandle)
}
7.8.4 Mediated Access
The registry mediates all cross-driver service access. This is critical:
- The registry validates capabilities before returning a service handle.
- The returned vtable is wrapped by the registry — calls go through a trampoline that:
- Validates the service handle is still valid
- Performs the isolation domain switch if provider and client are in different Tier 1 domains
- Handles the user-kernel transition if one side is Tier 2
- The registry can revoke a service link at any time (e.g., when the provider crashes).
- The registry tracks all active links for PM ordering (clients must suspend before providers).
- Drivers never hold direct pointers to each other's memory.
7.8.5 Service Recovery
When a provider driver crashes and is reloaded:
- The registry invalidates all service handles pointing to the crashed provider.
- Client drivers that call the service vtable receive
-ENODEVfrom the trampoline. - After the provider is reloaded and republishes its service, client drivers receive
a
service_recoveredcallback (optional, new addition toDriverEntry):
// Appended to DriverEntry (optional)
pub service_recovered: Option<unsafe extern "C" fn(
ctx: *mut c_void,
service_name: *const u8,
service_name_len: u32,
) -> InitResultCode>,
The client driver can then re-acquire the service handle and resume operations.
7.8.6 Registry Event Notifications
Beyond driver-to-driver service recovery, kernel subsystems need to react to device lifecycle events. The registry provides an internal notification mechanism (not exposed through KABI — this is kernel-to-kernel only).
/// Registry event types that kernel subsystems can subscribe to.
#[repr(u32)]
pub enum RegistryEvent {
/// A new device node was created (after bus enumeration).
DeviceDiscovered = 0,
/// A device transitioned to Active (driver bound and initialized).
DeviceActive = 1,
/// A device is being removed (before teardown begins).
DeviceRemoving = 2,
/// A device's driver crashed and recovery is starting.
DeviceRecovering = 3,
/// A device's power state changed.
PowerStateChanged = 4,
/// IOMMU group assignment changed (passthrough ↔ kernel domain).
IommuGroupChanged = 5,
/// A service was published or unpublished.
ServiceChanged = 6,
}
/// Callback type for registry event notifications.
pub type RegistryNotifyFn = fn(
event: RegistryEvent,
node_id: DeviceNodeId,
context: *mut c_void,
);
Subscribers:
| Kernel Subsystem | Events | Purpose |
|---|---|---|
| Memory manager (Section 12) | DeviceDiscovered, DeviceRemoving |
Update NUMA topology when devices with local memory appear/disappear |
| Scheduler (Section 14) | DeviceActive, DeviceRemoving |
Update IRQ affinity recommendations |
| FMA engine (Section 39) | DeviceRecovering |
Log fault management events, track failure patterns |
| AccelScheduler (Section 42) | DeviceActive, DeviceRecovering, PowerStateChanged |
Manage accelerator context lifecycle |
| Sysfs compat (Section 7.12) | All events | Update /sys filesystem in real-time |
Notifications are dispatched synchronously during registry state transitions. Subscribers must not block — they record the event and defer heavy work to a workqueue. This prevents a slow subscriber from delaying device bring-up.
7.9 KABI Integration
7.9.1 New Methods Appended to KernelServicesVTable
All new methods are Option<...> for backward compatibility. Older kernels that do not
have the registry will have these as None. Drivers must check for None before calling.
// === Device Registry (appended to KernelServicesVTable) ===
/// Report a newly discovered device to the registry.
/// Called by bus drivers (PCI enumeration, USB hub, etc.).
pub registry_report_device: Option<unsafe extern "C" fn(
parent_handle: DeviceHandle,
bus_type: BusType,
bus_identity: *const u8,
bus_identity_len: u32,
properties: *const PropertyEntry,
property_count: u32,
out_handle: *mut DeviceHandle,
) -> IoResultCode>,
/// Report that a device has been physically removed.
pub registry_report_removal: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
) -> IoResultCode>,
/// Get a property value from a device node.
pub registry_get_property: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
key: *const u8,
key_len: u32,
out_value: *mut PropertyValueC,
out_value_size: *mut u32,
) -> IoResultCode>,
/// Set a property on a device node.
pub registry_set_property: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
key: *const u8,
key_len: u32,
value: *const PropertyValueC,
value_size: u32,
) -> IoResultCode>,
/// Publish a named service on this device node.
pub registry_publish_service: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
service_name: *const u8,
service_name_len: u32,
service_vtable: *const c_void,
service_vtable_size: u64,
) -> IoResultCode>,
/// Look up a named service.
pub registry_lookup_service: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
service_name: *const u8,
service_name_len: u32,
scope: u32,
out_service_vtable: *mut *const c_void,
out_service_handle: *mut ServiceHandle,
) -> IoResultCode>,
/// Release a previously acquired service handle.
pub registry_release_service: Option<unsafe extern "C" fn(
service_handle: ServiceHandle,
) -> IoResultCode>,
/// Get the device handle for the current driver instance.
pub registry_get_device_handle: Option<unsafe extern "C" fn(
out_handle: *mut DeviceHandle,
) -> IoResultCode>,
/// Enumerate children of a device node.
pub registry_enumerate_children: Option<unsafe extern "C" fn(
device_handle: DeviceHandle,
out_handles: *mut DeviceHandle,
max_count: u32,
out_count: *mut u32,
) -> IoResultCode>,
7.9.2 New ABI Types
/// Opaque handle to a device node in the registry.
#[repr(C)]
pub struct DeviceHandle {
pub id: u64,
}
impl DeviceHandle {
pub const INVALID: Self = Self { id: 0 };
}
/// Opaque handle to an acquired service.
#[repr(C)]
pub struct ServiceHandle {
pub id: u64,
}
/// A property entry for C ABI transport.
#[repr(C)]
pub struct PropertyEntry {
pub key: *const u8,
pub key_len: u32,
pub value_type: PropertyType,
pub value_data: *const u8,
pub value_len: u32,
pub _pad: u32,
}
#[repr(u32)]
pub enum PropertyType {
U64 = 0,
I64 = 1,
String = 2,
Bytes = 3,
Bool = 4,
StringArray = 5,
}
/// C-ABI-safe property value output buffer.
#[repr(C)]
pub struct PropertyValueC {
pub value_type: PropertyType,
pub _pad: u32,
pub data: [u8; 256],
}
7.9.3 DeviceDescriptor Extension
The existing DeviceDescriptor gains new fields (appended):
// Appended to DeviceDescriptor
pub device_handle: DeviceHandle, // Registry handle for this device
pub numa_node: i32, // NUMA node (-1 = unknown)
pub _pad: u32,
The DeviceDescriptor passed to driver_entry.init() is now populated from the
registry node's properties, ensuring consistency between what the registry knows and
what the driver sees.
7.10 Crash Recovery Integration
The registry participates in the crash recovery sequence defined in Section 9.
7.10.1 When a Driver Crashes
-
Detection: ISLE Core detects the fault (hardware exception in isolation domain, watchdog timeout, Tier 2 process crash).
-
Registry notification: ISLE Core identifies the faulting driver's device node. Registry transitions it to
Recovering. -
Service invalidation: All service handles pointing to the crashed driver are invalidated. Client drivers receive
-ENODEVon subsequent service calls. -
Child cascade: If the crashed driver is a bus driver with children, the registry processes children bottom-up:
- For each child: stop driver, release capabilities, transition to
Stopping. -
Children are re-probed after the bus driver recovers.
-
I/O drain + DMA fence: All pending I/O completed with
-EIO. Critically, before freeing any driver memory, ISLE must ensure no in-flight DMA operations can write to those pages. The sequence: - IOMMU mapping for the driver's DMA regions is revoked (set to fault-on-access) immediately at step 2 (ISOLATE). Any in-flight DMA that completes after this point will hit an IOMMU fault (harmless — the write is dropped by the IOMMU).
- IOTLB invalidation with drain: The IOMMU's IOTLB is invalidated to ensure all
cached translations are flushed. On Intel VT-d, this uses the Invalidation Wait
Descriptor with
IWD=1to wait for invalidation completion (NOT for in-flight DMA). On AMD, theCOMPLETION_WAITcommand provides similar functionality. Note: these mechanisms wait for the IOTLB flush, not for device DMA completion. - Device quiescence: After IOTLB invalidation, ISLE ensures device quiescence via one of: (a) FLR (Function Level Reset) which halts all device operations, (b) device- specific stop commands (e.g., NVMe controller disable), or (c) a configurable delay (default 100 ms) if neither is available. This is the actual guarantee that in-flight DMA has stopped or will fault.
- Only after both IOTLB invalidation AND device quiescence is the driver's private memory freed.
-
Why this matters: without the DMA fence, a device that was mid-DMA at crash time could write to memory that has been freed and reallocated to another driver or to userspace — a use-after-free via hardware. The IOMMU revocation + IOTLB flush + device quiescence eliminates this class of vulnerability.
-
Device reset: FLR for PCIe, port reset for USB, etc.
-
Driver reload: Fresh binary loaded, new vtable exchange. The
DeviceDescriptorretains the sameDeviceHandle— the device's identity in the registry is preserved across crashes. -
Service re-publication: Reloaded driver publishes its services again. Registry notifies clients via
service_recoveredcallback. -
Child re-probe: If this was a bus driver, the registry re-enumerates and re-probes child devices.
7.10.2 Failure Counter Integration
The registry's per-node failure_window (a FailureWindow sliding-window counter)
feeds into the existing auto-demotion policy. The counter records timestamps in a
16-entry circular buffer; the policy query asks "how many entries fall within the
last N seconds?" (default window: 1 hour):
failure_window.count_within(1 hour):
0-2: Reload at same tier
3+: Demote to next lower tier (if minimum_tier allows)
5+: Transition to Quarantined state (driver permanently disabled, device
unbound); requires manual administrator re-enable via sysfs. Log critical alert.
This is the same policy described in Section 9, now with the registry as the tracking mechanism.
How auto-demotion works without recompilation — A driver that can run in both Tier 1
(isolation domain, Ring 0) and Tier 2 (process, Ring 3) does not need two separate
binaries. The KABI vtable abstraction (Section 5) provides identical function signatures
regardless of tier. The difference is in the hosting environment: Tier 1 drivers are
loaded as shared objects into a kernel isolation domain; Tier 2 drivers are loaded as
processes. The same .isle binary is valid in both contexts because KABI syscalls (ring
buffer operations, capability invocations) are designed to work from either Ring 0 or
Ring 3 — the Tier 1 path uses direct function calls via the vtable, while the Tier 2 path
uses syscall wrappers that implement the same vtable interface. Auto-demotion simply means
"restart this driver binary in a Tier 2 process instead of a Tier 1 isolation domain."
The driver code is unaware of the change; only the hosting environment differs.
7.11 Boot Sequence Integration
The registry integrates into the boot sequence (Section 17.3):
4. ISLE Core initialization:
a. Parse boot parameters and ACPI tables
b. Initialize physical memory allocator
c. Initialize virtual memory
d. Initialize per-CPU data structures
e. Initialize Tier 0 drivers: APIC, timer, early console
f. Initialize capability system
g. Initialize device registry <-- NEW
h. Register Tier 0 devices in registry <-- NEW
i. Initialize scheduler
j. Mount initramfs
5. ACPI/DT enumeration: populate registry <-- NEW
6. PCI enumeration: create device nodes <-- NEW
7. Registry runs match engine, loads storage driver <-- REPLACES ad-hoc loading
8. Mount real root filesystem
9. Continue device enumeration (USB, etc.) <-- NEW
10. Execute /sbin/init
7.11.1 Tier 0 Devices
Tier 0 drivers (APIC, timer, serial) are statically linked and initialized before the registry exists. After registry init, they are registered retroactively:
registry.register_tier0_device("apic", ...);
registry.register_tier0_device("timer", ...);
registry.register_tier0_device("serial0", ...);
These nodes are created directly in Active state with no match/load cycle.
7.11.1a Console Handoff
The display and input stack transitions through multiple phases during boot. The handoff protocol ensures zero message loss and graceful degradation.
Phase 1 — Tier 0 (early boot):
- Serial console (COM1/PL011/16550) is active from the first instruction.
- VGA text mode (80×25) initialized by BIOS/UEFI firmware on x86-64.
- All kernel output goes to the ring buffer (klog), serial, and VGA text mode
simultaneously. The ring buffer captures every message from the first printk.
Phase 2 — Tier 1 loaded (DRM/KMS driver): - The DRM/KMS display driver initializes, performs modeset, and allocates a framebuffer. - A framebuffer console renderer (fbcon) is initialized with the target resolution.
Handoff protocol:
1. DRM driver completes modeset, signals "console ready" via KABI callback:
driver_event(CONSOLE_READY, framebuffer_info)
2. Kernel console subsystem:
a. Locks the console output path (brief pause, <1ms)
b. Replays the full ring buffer contents onto the framebuffer console
— no boot messages are lost, the user sees the complete boot log
c. Registers fbcon as the primary console output
d. Unlocks the console output path
3. Serial console remains active — never disabled. All output goes to BOTH
serial and framebuffer. This ensures remote management always works.
4. VGA text mode driver is deregistered as the *primary* console backend.
The VGA text mode memory region (0xB8000) is NOT released to the physical
memory allocator — it is reserved as a panic-only fallback (see below).
The region is small (4000 bytes) and the cost of keeping it reserved is
negligible compared to the benefit of having a guaranteed crash output path.
Keyboard handoff: - Early boot: PS/2 scan code handler (Tier 0) captures keystrokes into a buffer. This allows emergency interaction (e.g., boot parameter editing) before USB is up. - Tier 1 loaded: USB HID driver initializes, registers as input device. The input subsystem drains the PS/2 keystroke buffer — no keystrokes are lost. - PS/2 handler remains active for keyboards physically connected via PS/2.
Virtual terminals:
- VT switching (Ctrl+Alt+F1–F6) is implemented in isle-core's input multiplexer,
NOT in the display driver. The display driver is a passive renderer.
- On VT switch, the input multiplexer sends a SWITCH_VT(n) command to the
display driver via KABI. The driver switches which virtual framebuffer is scanned
out.
- This design means a crashing display driver doesn't break VT switching logic —
on driver recovery, the multiplexer re-sends the current VT state.
Crash fallback: - If the DRM driver faults, the core reverts to VGA text mode (x86-64) or serial-only (AArch64/RISC-V/PPC) for panic output. Tier 0 console backends are always available. - The panic handler bypasses the normal console locking path and writes directly to the Tier 0 backends (serial + VGA text if available).
7.11.2 PCI Enumeration
PCI enumeration is part of ISLE Core (Tier 0 functionality in early boot). It walks PCI configuration space and creates device nodes:
For each PCI bus (starting from bus 0):
For each device 0-31, function 0-7:
If device present:
1. Create DeviceNode with PCI bus identity
2. Populate properties: vendor-id, device-id, class-code, BARs, IRQs
3. If this is a bridge: create a bus node, recurse into secondary bus
4. Set numa_node from ACPI SRAT proximity domain
5. Registry runs match engine for this node
7.11.3 NUMA Awareness
ACPI SRAT (System Resource Affinity Table) provides NUMA topology. The registry uses
this to set numa_node on each device node based on the device's proximity domain (PCI
devices inherit from their root port's NUMA node).
This information is available for: - Driver memory allocation: Prefer the device's NUMA node. - DMA buffer allocation: Prefer the device's NUMA node. - IRQ affinity: Suggest CPU affinity matching the device's NUMA node. - Tier 1 domain assignment: Prefer grouping NUMA-local devices when isolation domains are shared.
7.11.4 ACPI Enumerator
The ACPI enumerator is Tier 0 kernel-internal code that walks the ACPI namespace and creates platform device nodes in the registry. It handles the tables that define hardware topology:
| ACPI Table | Registry Impact |
|---|---|
| MCFG (PCI Express Memory Mapped Config) | Defines PCI segment groups and ECAM base addresses. The PCI enumerator uses these to access PCI config space. |
| SRAT (System Resource Affinity) | Maps PCI bus ranges and memory ranges to NUMA proximity domains. Sets numa_node on device nodes. |
| DMAR / IVRS (DMA Remapping) | Defines IOMMU hardware. Creates IOMMU group assignments (Section 7.3.8). Intel DMAR for VT-d, AMD IVRS for AMD-Vi. |
| DSDT / SSDT (Differentiated System Description) | Defines platform devices (embedded controllers, power buttons, battery, thermal zones). Each ACPI device object becomes a platform device node. |
| HPET / MADT | Timer and interrupt controller topology. Creates Tier 0 device nodes for APIC, I/O APIC, HPET. |
AML evaluation: The ACPI enumerator includes an AML (ACPI Machine Language)
interpreter for evaluating _STA (device status), _CRS (current resources), and
_HID (hardware ID) methods. This is a significant subsystem but is
required for correct hardware enumeration on any x86 system. The AML interpreter runs
in Tier 0 with full kernel privileges because it accesses hardware registers directly.
Device Tree enumerator (AArch64/RISC-V/PPC): Parses the flattened device tree (FDT)
passed by the bootloader. Each DT node with a compatible property becomes a platform
device node. The reg property populates DeviceResources.bars (as MMIO regions), and
the interrupts property populates DeviceResources.irqs. DT phandle references
become provider-client service links.
7.11.4a Firmware Quirk Framework
ACPI tables and Device Trees are authored by firmware engineers and are notoriously
buggy. Linux has accumulated thousands of firmware workarounds scattered across
subsystem-specific code (drivers/acpi/, arch/x86/kernel/, DMI match tables,
ACPI override tables). ISLE centralizes firmware workarounds into a structured quirk
framework, similar to the CPU errata framework (Section 17.4).
The problem is real — common firmware bugs observed in the wild:
- ACPI _CRS (Current Resources) reports incorrect MMIO ranges for PCI bridges,
causing resource conflicts
- SRAT (NUMA affinity) tables claim all memory belongs to NUMA node 0 on multi-socket
systems (broken BIOS update)
- DMAR (IOMMU) tables omit devices or report wrong scope, causing IOMMU group
misassignment
- Device Tree interrupt-map entries with wrong parent phandle references (ARM SoC
vendor bugs)
- DSDT/SSDT AML code with infinite loops, incorrect register addresses, or methods that
return wrong types
- MADT reports non-existent APIC IDs (causes boot failure if kernel trusts them)
- ECAM (PCI config space) base address wrong in MCFG table
ISLE's firmware quirk table:
/// Firmware quirk entry — matches a system to its required workarounds.
struct FirmwareQuirk {
/// System identification (DMI vendor + product + BIOS version).
match_id: DmiMatch,
/// ACPI table match (optional — match specific table revision).
table_match: Option<AcpiTableMatch>,
/// Human-readable quirk identifier.
quirk_id: &'static str,
/// Workaround: override, ignore, or patch firmware data.
action: QuirkAction,
}
enum QuirkAction {
/// Override a specific ACPI table with a corrected version (ACPI override).
OverrideTable { table_signature: [u8; 4], replacement: &'static [u8] },
/// Ignore a specific device entry in DMAR/IVRS (broken IOMMU scope).
IgnoreIommuDevice { segment: u16, bus: u8, device: u8, function: u8 },
/// Override NUMA affinity for a memory range (broken SRAT).
OverrideNumaAffinity { phys_start: u64, phys_end: u64, node: u32 },
/// Ignore an APIC ID in MADT (non-existent CPU).
IgnoreApicId { apic_id: u32 },
/// Patch a specific AML method (replace bytecode).
PatchAml { path: &'static str, replacement: &'static [u8] },
/// Skip enumeration for a device matching this HID (broken _CRS).
SkipDevice { hid: &'static str },
/// Custom workaround function.
Custom(fn() -> Result<()>),
}
Quirk database population — the initial quirk database is seeded from:
1. Linux's existing DMI quirk tables (drivers/acpi/, arch/x86/pci/) — these
document decades of firmware workarounds with specific DMI match strings
2. Community-reported firmware bugs (same mechanism as Linux's bugzilla)
3. Vendor-provided errata sheets (when available)
ACPI table override — Linux supports loading replacement ACPI tables from initramfs
(CONFIG_ACPI_TABLE_UPGRADE). ISLE supports the same mechanism: if a corrected DSDT
is placed in the initramfs at /lib/firmware/acpi/, it replaces the firmware-provided
table at boot. This allows users to fix firmware bugs without waiting for a BIOS update.
Boot-time quirk logging — all applied quirks are logged at boot:
isle: Firmware quirk applied: DELL-POWEREDGE-R740-BIOS-2.12 — DMAR ignore device 0000:00:14.0 (broken IOMMU scope)
isle: Firmware quirk applied: LENOVO-T14S-BIOS-1.38 — SRAT override node 0→1 for range 0x100000000-0x200000000
Why ISLE is more sensitive to firmware bugs than Linux — ISLE's topology-aware
device registry derives NUMA affinity, IOMMU groups, power management ordering, and
driver isolation domains from firmware-reported topology. A firmware bug that reports
wrong NUMA affinity causes ISLE to place a driver on the wrong NUMA node (performance
degradation). In Linux, the same bug might cause a suboptimal numactl suggestion but
doesn't affect driver placement (Linux doesn't have topology-aware driver isolation).
This means ISLE must invest more heavily in firmware workarounds than Linux for the
same set of hardware. The structured quirk framework makes this manageable — adding a
new workaround is a single table entry, not scattered if (dmi_match(...)) checks
across the codebase.
Defensive parsing — beyond per-system quirks, all firmware table parsers are defensively coded: - ACPI table lengths are validated against the RSDP/XSDT-reported size - AML interpreter has an instruction count limit (prevents infinite loops in AML code) - Device Tree parser validates all phandle references before dereferencing - PCI config space reads are bounds-checked against MCFG-reported ECAM regions - Any parse failure is logged as an FMA event (Section 39) and the offending entry is skipped rather than causing a boot failure
7.11.5 Resource Assignment
During PCI enumeration, the registry assigns hardware resources to each device:
For each PCI device:
1. Read BAR registers to determine resource requirements (size, type).
2. Assign physical address ranges from the PCI memory/IO space allocator.
- MMIO BARs: allocate from PCI MMIO window (defined by ACPI `_CRS`
method on the PCI host bridge device; MCFG defines only the ECAM base
address for PCIe configuration space access).
- I/O BARs: allocate from PCI I/O window (legacy x86, rare).
3. Write assigned addresses back to BAR registers.
4. Populate DeviceResources.bars with the assigned mappings.
5. Allocate MSI/MSI-X vectors:
- If device supports MSI-X: allocate up to min(device_max, driver_requested) vectors.
- If MSI only: allocate power-of-2 vectors up to device limit.
- Fallback: assign legacy INTx pin.
6. Populate DeviceResources.irqs.
Resource conflicts (overlapping BAR assignments, IRQ vector exhaustion) are detected
during enumeration and logged as FMA events (Section 39). Conflicting devices remain
in Discovered state with no driver bound.
7.12 Sysfs Compatibility
The registry is the single source of truth for the /sys filesystem required by Linux
compatibility (Section 20.3).
7.12.1 Mapping
| Sysfs Path | Registry Source |
|---|---|
/sys/devices/ |
Device tree traversal (parent-child edges) |
/sys/bus/pci/devices/ |
All nodes with bus_type == Pci |
/sys/bus/usb/devices/ |
All nodes with bus_type == Usb |
/sys/class/block/ |
Nodes publishing "block" service |
/sys/class/net/ |
Nodes publishing "net" service |
/sys/devices/.../driver |
driver_binding.driver_name |
/sys/devices/.../power/ |
Power state and runtime PM policy |
/sys/devices/.../uevent |
Generated from node properties |
7.12.2 Attribute Files
Each standard property maps to the expected sysfs attribute format:
- vendor → property "vendor-id" formatted as 0x%04x
- device → property "device-id" formatted as 0x%04x
- class → property "class-code" formatted as 0x%06x
Custom driver-set properties appear under a properties/ subdirectory.
7.12.3 Device Class via Service Names
Linux's /sys/class/ directories are derived from service publication:
- A driver that publishes a "net" service → device appears under /sys/class/net/
- A driver that publishes a "block" service → device appears under /sys/class/block/
- A driver that publishes a "input" service → device appears under /sys/class/input/
This is more principled than Linux's explicit class_create() calls because the
classification falls naturally out of what the driver actually does.
7.13 Concurrency and Performance
7.13.1 Locking Strategy
- Read path (hot): Property queries, service lookups, sysfs reads. Reader-writer lock allows concurrent reads.
- Write path (cold): Node creation, state transitions, driver binding, hotplug. Takes exclusive write lock.
- Per-node state: Atomic field for lock-free state checks ("is this device active?" does not need the tree lock).
- PM ordering cache: Computed once per PM transition. Invalidated when tree topology changes (hotplug).
7.13.2 Scalability
- Device enumeration: O(n*m) where n = match rules, m = unmatched devices. With <1000 drivers and <200 devices on a typical system, this completes in microseconds. Runs once at boot + on hotplug.
- Service lookup: Hash-indexed by service name. O(1) amortized.
- Property query: Binary search on sorted PropertyTable. O(log n), n < 30.
- PM ordering: Topological sort is O(V+E) where V = nodes, E = edges. Computed once, cached.
7.13.3 Memory Budget
| Component | Per Node | Notes |
|---|---|---|
| DeviceNode struct | ~512 bytes | Fixed-size fields |
| PropertyTable (avg 15 props) | ~1 KB | Key strings + values |
| Children/providers/clients | ~128 bytes | Vec overhead |
| Total per node | ~1.7 KB |
A typical desktop with ~200 devices: ~340 KB. A busy server with ~1000 devices: ~1.7 MB. Well within kernel memory budget.
7.14 Open Questions
These are design questions to be resolved as implementation progresses:
1. USB topology depth. USB allows 7 levels of hubs. Should the registry represent the full hub topology or flatten it? Recommendation: full topology — it matches physical reality and is needed for correct PM ordering and surprise-removal cascading.
2. GPU sub-device modeling. Modern GPUs have display controllers, compute engines, video encoders as sub-functions. How should these be modeled? Recommendation: as child nodes of the GPU device node, each potentially bound to a different sub-driver.
3. Firmware enumerators. ACPI and Device Tree both provide device descriptions. The registry should have pluggable enumerator backends. The ACPI enumerator creates platform device nodes from the ACPI namespace; the DT enumerator creates nodes from the flattened device tree. These are Tier 0 kernel-internal code, not KABI drivers.
4. Multi-function PCI devices. A PCI device with multiple functions should have one node per function, all children of the same PCI slot/device node. The slot node is a child of the PCI bridge. (Matches Linux's PCI topology model.)
5. Service versioning. Should services published via registry_publish_service have
version negotiation? Recommendation: yes, using the same InterfaceVersion
mechanism already in the SDK (isle-driver-sdk/src/version.rs). The service vtable
includes a vtable_size and version field, same as all other KABI vtables.
6. Multi-provider services. What if multiple nodes provide the same service name
(e.g., multiple I2C controllers)? Recommendation: registry_lookup_service returns the
closest match by topology (prefer siblings, then parent subtree). A variant
registry_lookup_all_services could return all providers for cases like RAID member
enumeration.
7. Persistent device naming. Should the registry provide stable device names across
reboots (like Linux's /dev/disk/by-id/)? Recommendation: yes, derived from
bus identity + serial number properties. The registry generates stable symlink targets;
the compat layer creates the actual /dev/disk/by-* symlinks.
8. IOMMU group granularity for SR-IOV. SR-IOV virtual functions (VFs) can potentially
each be their own IOMMU group (if ACS is present on the PF's upstream port). Should the
registry auto-create VF device nodes when a PF driver enables SR-IOV? Recommendation:
yes — the PF driver calls a KABI method registry_create_vf_nodes(count), and the
registry creates child device nodes with their own BusIdentity::Pci entries, IOMMU
group assignments, and resource allocations.
9. AML interpreter scope. The ACPI AML interpreter is complex. Should it
support the full ACPI 6.5 specification, or a minimal subset sufficient for x86 server
and desktop hardware? Recommendation: minimal subset initially, but the "minimum"
for real hardware is larger than it appears. The bare minimum methods for boot on
production x86 hardware: _STA, _CRS, _HID, _UID, _BBN (base bus number),
_SEG (PCI segment), _PRT (PCI routing table), _OSI (OS identification — most
DSDTs gate behavior on this), _DSM (device-specific method — used by PCIe, NVMe,
USB controllers), _PS0/_PS3 (power state transitions — required for device
wake/sleep), _INI (device initialization), and _REG (operation region handler
registration). Without _OSI and _DSM, most x86 laptops and many servers fail to
enumerate devices correctly. Estimate: a production AML interpreter supporting these
methods plus the required AML bytecode opcodes is 15-25K SLOC. Add methods beyond this
set as hardware compatibility demands.
10. Resource reservation for hot-plug. PCI hot-plug bridges need pre-reserved address space for devices that may appear later. How much space to reserve? Recommendation: follow Linux's heuristic (256MB MMIO per hot-plug slot as default, configurable via kernel parameter). The registry's PCI allocator should leave gaps in the address map for slots marked as hot-plug capable in ACPI.
11. KABI long-term evolution. The append-only vtable design (Section 5.3) guarantees forward compatibility but raises the question of long-term bloat. After many versions, vtables accumulate dead methods unused by any active driver. Section 5.3a defines a 5-major-release support window with deprecation and removal phases. Open question: should the support window be shorter (3 releases, ~3 years) for faster cleanup, or longer (7 releases, ~7 years) for maximum driver compatibility? Recommendation: 5 releases as the default, with the option to extend individual KABI versions via an "LTS KABI" designation for enterprise-critical driver ecosystems (e.g., a storage vendor's driver certified for a 7-year enterprise distro lifecycle gets a KABI LTS extension for that version).
12. IOMMU nested translation performance. Two-level IOMMU translation (Section 7.3.8) is known to cause significant TLB pressure under SR-IOV/VFIO workloads. Should ISLE implement IOMMU large page promotion (consolidating 4KB mappings into 2MB IOMMU pages) proactively, or only when performance monitoring detects IOTLB thrashing? Recommendation: proactive — always prefer the largest IOMMU page size that fits the DMA mapping. The memory overhead of tracking IOMMU large pages is minimal compared to the IOTLB miss penalty.
7.15 Firmware Management
Devices need firmware updates. The kernel provides infrastructure for loading and updating device firmware without requiring device-specific userspace tools.
7.15.1 Firmware Loading
Firmware loading flow (boot and runtime):
1. Driver calls kabi_request_firmware(name, device_id).
2. Kernel searches firmware paths in order:
a. /lib/firmware/updates/<name> (admin overrides)
b. /lib/firmware/<name> (distro-provided)
c. Initramfs embedded firmware (for boot-critical devices)
3. If found: kernel maps the firmware blob read-only into the
driver's isolation domain. Driver receives a FirmwareBlob handle
with .data() and .size() accessors.
4. Driver loads firmware to device via its own mechanism
(MMIO, DMA upload, vendor mailbox).
5. Driver releases the handle; kernel unmaps the blob.
Same semantics as Linux request_firmware() / request_firmware_nowait().
The async variant (kabi_request_firmware_async) does not block the
driver's probe path — useful for large firmware blobs (>10MB).
7.15.2 Firmware Update (Runtime)
Runtime firmware update (fwupd / vendor tools):
1. Userspace writes firmware capsule to /sys/class/firmware/<device>/loading.
2. Kernel validates:
a. Signature (mandatory: Ed25519 or PQC if enabled).
The signing key must match the device's firmware trust anchor
(embedded in device or provided by vendor via UEFI db).
b. Version (must be >= current version, prevents downgrade attacks
unless admin explicitly overrides via firmware.allow_downgrade=1).
3. Kernel notifies driver via KABI callback:
update_firmware(blob, blob_size) -> FirmwareUpdateResult.
4. Driver performs the device-specific update procedure:
- NVMe: Firmware Download + Firmware Commit (NVMe admin commands).
- GPU: vendor-specific update mechanism.
- NIC: flash update via vendor mailbox.
5. Driver returns result: Success, NeedsReset, Failed(error_code).
6. If NeedsReset: kernel marks device for reset. Reset can be
triggered immediately (if no active I/O) or deferred to next
maintenance window (admin-configurable).
UEFI capsule updates (system firmware):
Kernel writes capsule to EFI System Resource Table (ESRT) via
efi_capsule_update(). Actual update happens on next reboot.
Same mechanism as Linux (CONFIG_EFI_CAPSULE_LOADER).
Exposes /dev/efi_capsule_loader for userspace tools (fwupd).
7.15.3 Linux Compatibility
/sys/class/firmware/<device>/loading — firmware loading trigger
/sys/class/firmware/<device>/data — firmware blob upload
/sys/class/firmware/<device>/status — update status
/sys/bus/*/devices/*/firmware_node/ — ACPI firmware node link
/dev/efi_capsule_loader — UEFI capsule interface
fwupd works unmodified — it uses the standard sysfs firmware update interface and UEFI capsule loader, both of which are provided.
Appendix: Comparison with Prior Art
| Aspect | Linux | IOKit | Windows PnP | Fuchsia DF | ISLE |
|---|---|---|---|---|---|
| Tree owner | Kernel (kobject) | Kernel (IORegistry) | Kernel (devnode) | Userspace (devmgr) | Kernel (DeviceRegistry) |
| Matching | Per-bus (module_alias) | Property dict match | INF file rules | Bind rules | MatchRule in ELF .kabi_match |
| PM ordering | Heuristic (dpm_list) | IOPMPowerState tree | IRP tree walk | Component PM | Topological sort of device tree |
| Service discovery | Per-subsystem APIs | IOService matching | WDF target objects | Protocol/service | Unified registry_publish/lookup |
| Hot-plug | Per-bus callbacks | IOService terminate | PnP IRP dispatch | devmgr events | Registry-mediated events |
| Crash recovery | Kernel panic | IOService terminate | Bugcheck | Component restart | Registry-orchestrated reload |
| ABI coupling | Tight (kobject in driver) | Tight (C++ inheritance) | Tight (WDM/WDF) | Protocol-only | None (KABI vtable only) |
| Isolation | None | None | None | Process boundary | Domain isolation + process + capability |
8. Zero-Copy I/O Path
The entire I/O path from user space to device and back avoids all data copies. This is essential for matching Linux performance.
NVMe Read Example (io_uring SQPOLL + Registered Buffers)
Step 1: User writes SQE to io_uring submission ring
[User space, shared memory, 0 transitions]
Step 2: SQPOLL kernel thread reads SQE from ring
[ISLE Core, shared memory read, 0 copies]
Step 3: Domain switch to NVMe driver domain (~23 cycles on x86 MPK)
[Single WRPKRU on x86; MSR POR_EL0+ISB on AArch64 POE; MCR DACR on ARMv7]
Step 4: NVMe driver writes command to hardware submission queue
[Pre-computed DMA address from registered buffer]
Step 5: Domain switch back to ISLE Core (~23 cycles on x86 MPK)
[Submit path complete, return to core domain]
Step 6: NVMe device DMAs data directly to user buffer
[IOMMU-validated, zero-copy, device -> user memory]
Step 7: NVMe device writes completion to hardware CQ, raises interrupt
Step 8: Interrupt routes to NVMe driver (domain switch, ~23 cycles on x86 MPK)
Driver reads hardware CQE
Step 9: Domain switch back to ISLE Core (~23 cycles on x86 MPK)
Step 10: ISLE Core writes CQE to io_uring completion ring
[Shared memory write, 0 copies]
Step 11: User reads CQE from completion ring
[User space, shared memory, 0 transitions]
Summary: - Total data copies: 0 - Total domain switches: 4 (steps 3+5 on submit path, steps 8+9 on completion path) - Total domain switch overhead: ~92 cycles on x86 MPK (4 x ~23 cycles per Section 3; see Section 3 table for other architectures) - Device latency: ~3-10 us - Overhead percentage: < 1%
TCP Receive Path
Step 1: NIC DMAs packet to pre-posted receive buffer
[IOMMU-validated, zero-copy]
Step 2: NIC raises interrupt -> domain switch to NIC driver (~23 cycles on x86 MPK)
Step 3: NIC driver processes descriptor, identifies packet
Domain switch back to ISLE Core (~23 cycles on x86 MPK)
Step 4: ISLE Core dispatches to isle-net -> domain switch to isle-net (~23 cycles on x86 MPK)
Step 5: isle-net processes TCP headers, copies payload to socket buffer
(This is the one "copy" -- same as Linux. Technically a move
of ownership, not a memcpy, when using page-flipping.)
Step 6: Domain switch back to ISLE Core (~23 cycles on x86 MPK)
ISLE Core signals epoll/io_uring waiters
Step 7: User reads from socket via read()/recvmsg()/io_uring
Data delivered from socket buffer (zero-copy with MSG_ZEROCOPY)
Total domain switches: 4 (2 domain entries x 2 switches each: enter NIC driver + exit, enter isle-net + exit) Total domain switch overhead: ~92 cycles on x86 MPK (~20ns) on ~5 us path = ~2% (see Section 3 for other architectures)
8a. IPC Architecture and Message Passing
Section 8 describes the data plane -- how bytes flow from user space through Tier 1 drivers to devices and back with zero copies. This section describes the control plane that Section 8's data plane relies on: the IPC primitives that carry commands, completions, capability transfers, and event notifications between isolation domains.
8a.1 IPC Primitives
ISLE's IPC model has three distinct layers, each serving a different boundary:
1. Intra-kernel IPC (between isolation domains within Ring 0): domain ring buffers. Shared memory regions with per-domain access controlled by the isolation domain register (WRPKRU on x86, POR_EL0 on AArch64, DACR on ARMv7, etc.). Zero-copy, zero-syscall. This is the transport for all isle-core to Tier 1 driver communication — the command/completion flow shown in Section 8's NVMe and TCP examples. The "domain switch" at each step in those diagrams crosses a domain ring buffer boundary.
2. Kernel-user IPC (between kernel and user space): io_uring submission/completion rings. Standard Linux ABI (Section 20.5). Applications submit SQEs to the io_uring submission ring and receive CQEs from the completion ring. This is the only I/O interface that user space sees. ISLE's io_uring implementation is fully compatible with Linux 6.x semantics -- unmodified applications work without changes.
3. Inter-process IPC (between user processes): POSIX IPC.
Pipes, Unix domain sockets, POSIX message queues, and POSIX shared memory -- implemented
via the syscall interface (Section 20). These are not performance-critical kernel
paths; they exist for application compatibility. System V IPC (shmget, msgget, semget)
is supported but deprecated in favor of POSIX equivalents.
4. Hardware peer IPC (between the host kernel and a device running ISLE firmware): domain ring buffers over PCIe P2P.
A device that participates as a first-class cluster member (Section 47.2.2) communicates
with the host kernel via the same domain ring buffer protocol used for intra-kernel IPC
(Layer 1), transported over PCIe peer-to-peer MMIO and MSI-X interrupts instead of
in-process memory. From the host kernel's perspective, the device firmware endpoint is
just another ring buffer pair — the same DomainRingBuffer structure, the same
ClusterMessageHeader wire format, the same message-passing discipline. The transport
medium changes (PCIe instead of cache-coherent RAM); the abstraction does not.
This is not a compatibility shim. It is the intended model for first-class hardware
participation: a SmartNIC, DPU, computational storage device, or RISC-V accelerator
running ISLE presents an IPC endpoint identical in structure to an in-kernel Tier 1
driver, while owning its own scheduler, memory manager, and capability space.
See Section 47.2.2 for the wire protocol, implementation paths (A/B/C), and near-term
hardware targets.
The terms are not interchangeable. When this document says "io_uring", it means the userspace-facing async I/O interface. When it says "domain ring buffer", it means the internal kernel transport between isolation domains. An io_uring SQE from userspace triggers an isolation domain switch to a Tier 1 driver via a domain ring buffer — the two mechanisms are connected but architecturally distinct.
User space Kernel (Ring 0)
+-----------+ +------------------------------------------+
| App | | isle-core Tier 1 driver |
| | io_uring SQE | |
| SQ ring -|-------------------->|-> dispatch -----> domain cmd ring --------->|
| | | (WRPKRU) |
| | io_uring CQE | |
| CQ ring <|--------------------<|<- collect <----- domain cpl ring <---------|
| | | (WRPKRU) |
+-----------+ +------------------------------------------+
Layer 2 Layer 1 (internal)
(Linux ABI) (domain ring buffers)
8a.2 Domain Ring Buffer Design
Each Tier 1 driver has a pair of ring buffers shared with isle-core: a command ring (isle-core produces, driver consumes) and a completion ring (driver produces, isle-core consumes). Both use the same underlying structure:
/// A lock-free single-producer single-consumer ring buffer that lives in
/// a shared memory region accessible to exactly two isolation domains.
///
/// The header occupies two cache lines (one producer-owned, one
/// consumer-owned). Ring data follows immediately after the header,
/// aligned to `entry_size`.
#[repr(C, align(64))]
pub struct DomainRingBuffer {
/// Write claim position. Producers CAS this to claim slots (MPSC mode).
/// In SPSC mode, only the single producer increments this.
pub head: AtomicU32,
/// Published position. In MPSC mode, a producer increments this (in order)
/// AFTER writing data to the claimed slot. The consumer reads `published`
/// (not `head`) to determine how many entries are ready. In SPSC mode,
/// `published` always equals `head` (the single producer updates both).
/// In broadcast mode, this field is NOT the source of truth —
/// `last_enqueued_seq` (u64) is the authoritative write position. The
/// `published` field is derived (`write_seq / 2`) for diagnostic
/// compatibility only. Implementations MUST NOT increment `published`
/// independently in broadcast mode.
pub published: AtomicU32,
/// Number of entries. Must be a power of two.
pub size: u32,
/// Bytes per entry. Fixed at ring creation time.
pub entry_size: u32,
/// Number of entries dropped due to ring-full condition.
/// Monotonically increasing. Exposed via islefs diagnostics (§41).
pub dropped_count: AtomicU64,
/// Sequence number of the last successfully enqueued entry.
/// Consumers use this to detect gaps: if the consumer's last-seen
/// sequence is less than `last_enqueued_seq - ring_size`, entries
/// were lost.
/// In broadcast mode, this field serves as `write_seq` for torn-read
/// prevention (incremented by 2 per entry; odd = write-in-progress,
/// even = stable). See "Broadcast channels" below.
pub last_enqueued_seq: AtomicU64,
/// Padding to fill the producer cache line to exactly 64 bytes.
/// Layout: head(4) + published(4) + size(4) + entry_size(4)
/// + dropped_count(8) + last_enqueued_seq(8) + _pad(32) = 64.
_pad_producer: [u8; 32],
/// Read position. Only the consumer increments this.
/// On a separate cache line from head/published to avoid false sharing.
pub tail: AtomicU32,
/// Padding to fill the consumer cache line.
_pad_consumer: [u8; 60],
// Ring data follows: `size * entry_size` bytes.
}
Note on false sharing: size and entry_size are read-only after initialization and
are read by both producer and consumer. They are placed on the producer's cache line for
layout simplicity, but implementations SHOULD duplicate these values on the consumer's
cache line (as consumer_size and consumer_entry_size) to avoid false sharing. The
consumer reads only from its own cache line.
Lock-free SPSC protocol. The producer writes an entry at data[head % size], then
increments head and published together (in SPSC mode they are always equal). The
consumer reads the entry at data[tail % size] when published > tail, then increments
tail. No locks, no CAS, no contention. The head/published fields are on one cache
line (producer-owned); tail is on a separate cache line (consumer-owned). This
eliminates false sharing on hot paths.
Memory ordering. The producer uses Release ordering on the published store. The
consumer uses Acquire ordering on the published load. This pair ensures that the
entry data written by the producer is visible to the consumer before the consumer sees
the updated published counter.
On x86-64 this compiles to plain MOV instructions (TSO provides the required ordering
for free). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers
(stlr/ldar on ARM, fence-qualified atomics on RISC-V, lwsync/isync on PPC).
| Architecture | Producer (Release store) | Consumer (Acquire load) | Notes |
|---|---|---|---|
| x86-64 | MOV (TSO) |
MOV (TSO) |
No explicit barriers needed |
| AArch64 | STLR |
LDAR |
ARM's acquire/release instructions |
| RISC-V 64 | amoswap.w.rl or fence rw,w + sw |
lw + fence r,rw |
RVWMO requires explicit fencing |
| PPC32 | lwsync + stw |
lwz + isync |
Weak ordering; lwsync = lightweight sync |
| PPC64LE | lwsync + std |
ld + isync |
Same model as PPC32; lwsync preferred over sync |
Backpressure. When the ring is full (head - tail == size), the producer cannot write.
For SPSC rings (command and completion channels), isle-core handles this in two stages:
(1) spin for up to 64 iterations checking whether the consumer has advanced tail -- this
covers the common case where the driver is actively draining; (2) if the ring is still full
after spinning, yield to the scheduler via sched_yield_current() and retry on the next
scheduling quantum. This avoids wasting CPU on a stalled driver while keeping the fast path
lock-free. For MPSC rings (event channels), backpressure behavior depends on the calling
context — see the MPSC producer API contract in Section 8a.3 for the distinction between
blocking (mpsc_produce_blocking(), thread context only) and non-blocking
(mpsc_try_produce(), safe in any context) variants.
8a.3 Channel Types and Capability Passing
The ring buffer primitive from Section 8a.2 is instantiated in four channel configurations:
Command channels (SPSC): isle-core -> driver. One per driver instance. Carries I/O requests (read, write, discard), configuration commands (set queue depth, enable feature), and health queries (heartbeat, statistics request). Isle-core is the sole producer; the driver is the sole consumer.
Completion channels (SPSC): driver -> isle-core. One per driver instance. Carries I/O completions (success, error, partial), interrupt notifications (forwarded from the hardware interrupt handler), and error reports (device errors, internal driver faults). The driver is the sole producer; isle-core is the sole consumer.
Event channels (MPSC): multiple drivers -> isle-core event loop. Used for asynchronous
events that do not belong to a specific I/O flow: device hotplug notifications, link state
changes (NIC up/down), thermal throttle alerts, error notifications requiring global
coordination. Multiple drivers may need to signal the same event loop, so the MPSC variant
uses a compare-and-swap on head to coordinate multiple producers:
MPSC scaling limits: For event channels with >10 concurrent producers (unusual but possible in systems with many independent drivers signaling a single event loop), CAS contention on the ring head can degrade performance. In this regime, hierarchical fanout is recommended: drivers signal per-device intermediate rings, and an aggregator thread (or softirq batch) forwards events to the central ring. This reduces contention from O(producers) to O(1) at the cost of one additional indirection. The default single-ring design is optimized for the common case of 2-5 active producers per channel.
impl DomainRingBuffer {
/// MPSC non-blocking produce: multiple producers coordinate via CAS on head.
/// Returns Err(RingFull) immediately if the ring is full.
/// Safe to call from any context (thread, IRQ, softirq).
/// See "MPSC producer API contract" below for the blocking variant.
///
/// Two-phase commit protocol:
/// Phase 1 (claim): CAS on `head` to reserve a slot. After CAS success,
/// the slot is exclusively ours but NOT yet visible to the consumer.
/// Phase 2 (publish): After writing data, wait until `published` catches
/// up to our slot (ensuring in-order publication), then advance `published`.
///
/// The consumer reads `published` (not `head`) to determine ready entries.
/// This eliminates the data race where a consumer sees an incremented `head`
/// but reads a slot whose data has not yet been written.
pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingFull> {
// --- BEGIN interrupt-disabled section ---
// Disable interrupts BEFORE the Phase 1 CAS to prevent a deadlock:
// if an interrupt fires between a successful CAS (slot claimed) and
// Phase 2 (published advanced), an interrupt handler calling
// mpsc_try_produce on the same ring would spin forever in Phase 2
// waiting for the interrupted thread's slot to be published. Moving
// local_irq_save() here eliminates that race window entirely.
// The CAS loop is bounded (succeeds or returns RingFull), so the
// additional interrupt-disabled time is minimal.
let irq_state = arch::current::interrupts::local_irq_save();
// Phase 1: Claim a slot by advancing head (interrupts already disabled).
let my_slot;
loop {
let current_head = self.head.load(Ordering::Relaxed);
let current_tail = self.tail.load(Ordering::Acquire);
// Ring full?
if current_head.wrapping_sub(current_tail) >= self.size {
arch::current::interrupts::local_irq_restore(irq_state);
return Err(RingFull);
}
// Attempt to claim the slot.
if self
.head
.compare_exchange_weak(
current_head,
current_head.wrapping_add(1),
Ordering::AcqRel,
Ordering::Relaxed,
)
.is_ok()
{
my_slot = current_head;
break;
}
core::hint::spin_loop();
}
// Write entry data to the claimed slot.
let offset = (my_slot % self.size) as usize * self.entry_size as usize;
// SAFETY: offset is within bounds (power-of-two size, fixed entry_size).
// The slot is exclusively ours because we won the CAS race.
unsafe {
core::ptr::copy_nonoverlapping(
entry.as_ptr(),
self.data_ptr().add(offset),
self.entry_size as usize,
);
}
// Phase 2: Publish. Wait until all prior slots are published, then
// advance `published` to make our slot visible to the consumer.
// This spin is brief: it only waits for producers that claimed earlier
// slots to finish their writes. Under normal operation, this completes
// in 1-2 iterations.
//
// Drain deferred publications from previous calls. Before attempting
// our own Phase 2, drain ALL entries from the per-CPU deferred publish
// ring buffer. This ensures that deferrals from prior send() calls are
// re-attempted (and completed) before new entries are published, preventing
// silent loss if multiple producers defer in succession.
//
// The drain takes no arguments — each deferred entry stores a pointer to
// the ring's `published` counter alongside the slot index, so the drain
// correctly targets the ring that each slot belongs to (a producer may
// have deferred on ring A and now be calling send() on ring B).
arch::current::cpu::deferred_publish_drain();
// Bounded publish wait: To prevent unbounded interrupt-disabled spinning,
// Phase 2 uses a bounded spin of 64 iterations. If `self.published` has
// not advanced to `my_slot` within 64 iterations, the producer stores the
// ring's `published` pointer and its slot index as a pair into a per-CPU
// deferred publish ring buffer, then re-enables interrupts. The drain path
// (at the start of the next `send()` call and on the consumer side)
// re-attempts publication on behalf of the stalled producer, using the
// stored ring pointer to target the correct ring. The per-CPU deferred
// buffer is a ring (`[Option<(*const AtomicU32, u32)>; 16]` with
// head/tail indices) rather than a single `Option<u32>`, so multiple
// consecutive deferrals (potentially targeting different rings) can queue
// without silently losing earlier deferred values. The buffer holds 16
// entries (increased from an earlier 4-entry design to ensure bounded-time
// behavior under heavy contention). If the deferred buffer itself is full
// (16 outstanding deferrals — an extreme edge case indicating severe system
// overload), the producer re-enables interrupts before falling back to a
// bounded spin (up to 256 iterations with `core::hint::spin_loop()`). If the
// bounded spin also fails, the producer returns `Err(Overloaded)` to the
// caller, which applies backpressure (increment `dropped_count` for IRQ
// producers, or yield and retry for thread-context producers). This ensures
// the interrupt-disabled window is always bounded. The common-case bound is:
// Phase 1 CAS (~5ns, usually 1 attempt) + data write +
// drain (up to 16 entries * CAS each = ~80ns) + Phase 2 spin
// (up to 64 * ~5ns = ~320ns) = ~410ns in the common case.
let mut spin_count = 0u32;
loop {
if self
.published
.compare_exchange_weak(
my_slot,
my_slot.wrapping_add(1),
Ordering::Release,
Ordering::Relaxed,
)
.is_ok()
{
break;
}
spin_count += 1;
if spin_count >= 64 {
// Exceeded bounded spin — defer completion to the consumer drain
// path and re-enable interrupts to avoid unbounded IRQ latency.
// The deferred buffer holds up to 16 entries; if it is full,
// re-enable IRQs and fall through to bounded spin (system overloaded).
// Fence ensures entry data written at the slot is visible to
// all CPUs before the slot can be published by a deferred drain
// on any CPU. Without this, on weakly-ordered architectures
// (AArch64, RISC-V, PPC), a different CPU draining and publishing
// via CAS(Release) would only order its own stores, not the
// original writer's stores.
core::sync::atomic::fence(Ordering::Release);
if arch::current::cpu::deferred_publish_enqueue(&self.published, my_slot) {
arch::current::interrupts::local_irq_restore(irq_state);
return Ok(());
}
// Deferred buffer full — re-enable IRQs to preserve RT
// guarantees, then bounded spin outside the IRQ-disabled window.
arch::current::interrupts::local_irq_restore(irq_state);
let mut fallback_spin = 0u32;
loop {
if self.published.compare_exchange_weak(
my_slot, my_slot.wrapping_add(1),
Ordering::Release, Ordering::Relaxed,
).is_ok() {
return Ok(());
}
fallback_spin += 1;
if fallback_spin >= 256 {
// System severely overloaded — return error for backpressure.
return Err(RingError::Overloaded);
}
core::hint::spin_loop();
}
}
core::hint::spin_loop();
}
// --- END interrupt-disabled section ---
arch::current::interrupts::local_irq_restore(irq_state);
Ok(())
}
}
To prevent data loss when no future send() occurs, the per-CPU idle entry hook
(cpu_idle_enter(), Section 16) drains the deferred publish buffer for all MPSC rings
registered on that CPU. Additionally, when a thread that performed a deferred publish is
migrated to a different CPU, the migration path drains the source CPU's deferred buffer.
These hooks ensure deferred entries are published within a bounded window (at most one
scheduler tick, ~4ms).
MPSC Phase 2 preemption hazard and mitigation. The Phase 2 publish spin in
mpsc_try_produce() can stall if a producer is preempted (by an interrupt or scheduler)
between Phase 1 (CAS on head) and Phase 2 (advancing published). While preempted,
the published counter is stuck at the preempted producer's slot, blocking all
subsequent producers from making their entries visible to the consumer -- even though
their data is already written. This is not a deadlock (the preempted producer will
eventually resume and complete Phase 2), but it can cause unbounded latency spikes on
the consumer side.
Mitigation: ISLE addresses this in three ways:
-
Interrupts disabled from before Phase 1 through Phase 2. The MPSC produce path disables interrupts (not just preemption) BEFORE the Phase 1 CAS, keeping them disabled through the Phase 2
publishedcounter advancement. This prevents the following deadlock scenario: on a uniprocessor (or any CPU), thread T1 claims slot N via CAS, then an IRQ fires and the IRQ handler claims slot N+1 via CAS. The IRQ handler's Phase 2 spin waits forpublishedto reach N, but T1 cannot advancepublishedbecause it is interrupted — deadlock. Disabling interrupts before Phase 1 eliminates this window entirely (there is no gap between CAS success and interrupt disabling). The interrupt-disabled region covers: Phase 1 CAS (bounded — succeeds or returns RingFull, typically 1 attempt = ~5ns), data write, deferred drain (up to 4 entries = ~20ns), and Phase 2 publish CAS (up to 64 iterations = ~320ns), totaling ~350ns in the common case. On multiprocessor systems, disabling preemption alone would suffice (another CPU could run the interrupted producer), but disabling interrupts is correct on all configurations and the cost is negligible. -
Consumer-side timeout scan (defense in depth). The consumer (isle-core event loop) maintains a watchdog: if
head > publishedfor more than 1000 iterations of the poll loop (~10μs), the consumer treats the gap as a stalled producer and logs a diagnostic event. In the pathological case where a producer is stalled indefinitely (e.g., a Tier 1 driver that faulted between Phase 1 and Phase 2), the crash recovery path (Section 9) resets the ring buffer entirely — the staleheadclaim is discarded when the driver's isolation domain is revoked. -
Interrupt handlers use bounded produce. Interrupt handlers that produce to MPSC rings use
mpsc_try_produce(), which fails withErr(Full)if the ring is full rather than spinning. This prevents interrupt handlers from spinning on a full ring while the consumer (which runs in thread context) cannot drain it — if the consumer needs to be scheduled to make progress, a spinning IRQ handler creates an unbounded spin or deadlock.
MPSC producer API contract. The MPSC ring exposes two producer entry points with distinct calling context requirements:
impl DomainRingBuffer {
/// Non-blocking produce. Returns immediately if the ring is full.
/// Safe to call from ANY context (thread, IRQ, softirq).
///
/// On success: entry is enqueued and will be visible to the consumer
/// after Phase 2 publish completes.
/// On Err(Full): the ring has no free slots. The caller is responsible
/// for handling the overflow (see overflow accounting below).
pub fn mpsc_try_produce(&self, entry: &[u8]) -> Result<(), RingFull>;
/// Blocking produce. Spins (with bounded spin + yield) until a slot
/// becomes available, then enqueues the entry.
///
/// MUST NOT be called with interrupts disabled. If the ring is full,
/// this function spins waiting for the consumer to drain entries. If
/// interrupts are disabled, the consumer (which runs in thread context)
/// may never be scheduled, causing an unbounded spin.
///
/// In debug builds: panics immediately if called with interrupts
/// disabled (detected via arch::current::interrupts::are_enabled()).
/// In release builds: falls back to mpsc_try_produce() with overflow
/// accounting if interrupts are disabled (defense in depth — the debug
/// panic should catch all such call sites during development).
pub fn mpsc_produce_blocking(&self, entry: &[u8]);
}
Calling mpsc_produce_blocking() with interrupts disabled is a BUG. Debug builds
panic to catch the error during development; release builds fall back to
mpsc_try_produce() with overflow accounting to avoid a hard hang in production. The
release fallback is a safety net, not a license to call the blocking variant from IRQ
context — all such call sites must be fixed.
Overflow accounting. When mpsc_try_produce() returns Err(Full) (whether called
directly from IRQ context or as the release-build fallback), the caller increments a
per-ring atomic overflow counter. The overflow statistics are stored directly in the
DomainRingBuffer producer cache line as dropped_count and last_enqueued_seq
(see struct definition in Section 8a.2). Inlining these fields into the ring header
avoids an extra pointer dereference on the drop path and keeps both fields on the same
cache line as head and published, which are already hot during produce operations.
Each MPSC entry includes a monotonic sequence number in its header. The consumer detects dropped entries by checking for gaps in the sequence: if the sequence jumps from N to N+K (where K > 1), then K-1 entries were dropped due to overflow. The consumer logs a diagnostic event on gap detection, including the ring identity and gap size, so operators can identify rings that need larger depth configuration (Section 8a.4 channel depths).
Summary of context rules:
| Producer context | Permitted API | On ring full | Notes |
|---|---|---|---|
| Thread context (IRQs enabled) | mpsc_produce_blocking() or mpsc_try_produce() |
Blocking: spin + yield until space. Try: return Err(Full). |
Blocking variant is the normal path for thread-context producers. |
| IRQ handler / softirq | mpsc_try_produce() ONLY |
Return Err(Full), increment dropped_count, drop message. |
Calling the blocking variant is a BUG (debug panic / release fallback). |
| NMI / MCE handler | NEITHER — use per-CPU buffer | N/A | See NMI/MCE safety below. |
NMI/MCE safety: NMI handlers and Machine Check Exception (MCE) handlers MUST NOT produce to MPSC rings. Mitigation 1 (disabling interrupts) does NOT protect against NMIs or MCEs — both are non-maskable architectural exceptions that fire regardless of the interrupt flag state. If an NMI or MCE handler needs to log data, it must use a dedicated per-CPU single-producer buffer (not shared with normal interrupt context) that is drained by the main kernel after the exception returns. On x86, MCE handlers additionally run on a dedicated IST (Interrupt Stack Table) stack, so they must not access per-CPU data structures that assume the normal kernel stack.
Broadcast channels (SPMC): isle-core -> all drivers. Used for system-wide notifications
(suspend imminent, memory pressure, clock change). Isle-core writes once; each driver
reads independently. The broadcast channel uses a sequence-numbered ring with a single
sequencing mechanism: the last_enqueued_seq field (hereafter write_seq in broadcast
mode), a u64 in the ring header. write_seq increments by 2 for each published entry
(odd values indicate a write in progress; even values indicate a stable, readable entry —
see torn-read prevention below). The logical entry count is write_seq / 2. The
DomainRingBuffer's published field is not used independently in broadcast mode;
if read, it is derived as write_seq / 2 for compatibility with diagnostic code that
inspects published. Implementations must not increment published separately from
write_seq in broadcast mode — write_seq is the sole source of truth.
Each consumer tracks its own read position (a u64 sequence number stored in the
consumer's private memory, not in the shared ring header). To read, a consumer scans
from its last-seen sequence to the ring's current write_seq (even values only). The
ring's tail field is unused in broadcast mode — the producer never needs to know
individual consumer positions. Instead, the producer overwrites the oldest entry when
the ring is full (broadcast semantics: slow consumers miss entries rather than blocking
the producer). Consumers detect missed entries by checking for sequence gaps.
Torn-read prevention: Each broadcast ring entry is bracketed by a u64 sequence
stamp. Layout: [seq_start: u64 | payload: [u8; entry_size - 16] | seq_end: u64]. The
producer writes seq_start = write_seq | 1 (odd = write in progress), then the payload,
then seq_end = write_seq (even = complete), then advances write_seq by 2. The
consumer reads seq_start, copies the payload, reads seq_end. If
seq_end != (seq_start ^ 1), the read is torn — seq_start and seq_end are not a
matched pair from the same write (a concurrent write changes seq_start to a different
odd value, causing this check to fail). Additionally, if seq_start < consumer.last_seq,
the entry is stale. In either case, the consumer detects the gap, increments gap_count,
and advances to the next entry. All sequence accesses use Ordering::Acquire (reads)
and Ordering::Release (writes).
/// Per-consumer broadcast state (stored in consumer's private memory).
pub struct BroadcastConsumer {
/// Last sequence number consumed by this consumer.
pub last_seq: u64,
}
Capability passing. Capabilities (Section 11) can be transferred over any IPC channel.
The sending domain writes a CapabilityHandle (an opaque 64-bit token) into a ring buffer
entry. Isle-core intercepts the transfer at the domain boundary and validates the capability:
does the sender actually hold this capability? Is the capability transferable? Is the
receiver permitted to hold capabilities of this type? If validation passes, isle-core
translates the handle into the receiving domain's capability space -- the receiver gets a
new handle that maps to the same underlying resource but exists in its own namespace. Raw
capability data (kernel pointers, permission bitmasks) never crosses domain boundaries;
only validated, translated handles do.
8a.4 Flow Control and Ordering
Ordering within a channel. Ring buffer entries are processed in strict FIFO order within a single channel. If isle-core submits commands A, B, C to a driver's command ring, the driver sees them in A, B, C order. Completions flow back in the order the driver produces them (which may differ from submission order -- a driver may complete a fast read before a slow write).
No ordering across channels. There is no ordering guarantee between different channels.
Driver A's completion may arrive at isle-core before driver B's completion, regardless of
which command was submitted first. Applications that need cross-device ordering must
enforce it at the io_uring level (using IOSQE_IO_LINK or IOSQE_IO_DRAIN), which
isle-core translates into sequencing constraints on the domain command rings.
Channel depths. Each channel has a configurable entry count, set at ring creation time via the device registry (Section 7):
| Channel type | Default depth | Typical entry size | Notes |
|---|---|---|---|
| Command (SPSC) | 256 | 64 bytes | Matches NVMe SQ depth default |
| Completion (SPSC) | 1024 | 16 bytes | 4x command depth for batched completions |
| Event (MPSC) | 512 | 32 bytes | Shared across all drivers on this event loop |
| Broadcast (SPMC) | 64 | 32 bytes | Low-frequency system events |
The minimum useful broadcast entry size is 24 bytes (8 bytes payload with 16 bytes of
sequence stamps for torn-read prevention). The default of 32 bytes provides 16 bytes of
payload, suitable for most event notifications. Isle-core rejects broadcast ring creation
requests with entry_size < 24.
Depths are tunable per-driver via the device registry's ring_config property. Drivers
that handle high-throughput workloads (NVMe, high-speed NIC) typically increase command
depth to 1024 or 4096 to match hardware queue depths.
Priority channels. Real-time I/O (Section 16) uses a separate high-priority command ring per driver. The driver polls the priority ring before the normal ring on every iteration. This ensures RT I/O is not head-of-line blocked behind bulk I/O. Priority rings use the same SPSC structure but are typically shallow (32-64 entries) since RT workloads are low-volume, latency-sensitive flows.
isle-core dispatch logic (per driver, per poll iteration):
1. Check priority command ring -> process all pending entries
2. Check normal command ring -> process up to batch_limit entries
3. Check event ring (MPSC) -> process system events
Comparison with Linux. Linux has no equivalent to the intra-kernel domain ring buffer.
Subsystem communication within the Linux kernel uses direct function calls with no
isolation boundary. The closest analogy is Linux's io_uring internal implementation
(the SQ/CQ ring structure), but that serves a different purpose (kernel-to-userspace
communication). ISLE effectively uses an io_uring-inspired ring structure inside the
kernel to connect isolated subsystems that Linux connects via unprotected function calls.
8a.5 Terminology Reference
The following terms are used precisely throughout this document. This reference resolves ambiguity that arises from the word "ring" appearing in multiple contexts:
| Term | Meaning | Where used |
|---|---|---|
| io_uring | Linux-compatible userspace async I/O interface. SQ/CQ rings mapped into user space. | Section 20.5, user-facing I/O API |
| domain ring buffer | Internal kernel IPC mechanism between isolation domains. SPSC or MPSC lock-free rings in shared memory. | Sections 5-9, driver architecture |
| MPSC ring | A domain ring buffer variant with CAS-based multi-producer support. Used for event aggregation. | Section 8a.3, event channels |
| Hardware queue | Device-specific command/completion queues (e.g., NVMe SQ/CQ, virtio virtqueue). Mapped via MMIO. | Section 8, device I/O paths |
| SPSC | Single-Producer Single-Consumer. The default domain ring buffer mode. | Section 8a.2 |
| SPMC | Single-Producer Multi-Consumer. Used for broadcast channels (isle-core -> all drivers). | Section 8a.3 |
Any unqualified reference to "ring buffer" in the driver architecture sections (Sections 5-9) means a domain ring buffer. Any reference to "io_uring" means the userspace interface. Hardware queues are always qualified by device type (e.g., "NVMe submission queue", "virtio virtqueue").
9. Crash Recovery and State Preservation
This is ISLE's killer feature -- the primary reason to choose it over Linux.
Scope: This section covers Tier 1 and Tier 2 driver crash recovery where the host kernel acts as supervisor. For peer kernel crash recovery (devices running ISLE as a first-class multikernel peer), see Section 47.2.5, which uses a different isolation model (IOMMU hard boundary + PCIe unilateral controls rather than software domain supervision).
The Linux Problem
In Linux, all drivers run in the same address space with no isolation. A single bug in any driver -- null pointer dereference, buffer overflow, use-after-free -- triggers a kernel panic. Recovery requires a full system reboot: 30-60 seconds of downtime, loss of all in-flight state, and potential filesystem corruption if writes were in progress.
ISLE Tier 1 Recovery Sequence
When a Tier 1 (domain-isolated) driver faults:
1. FAULT DETECTED
- Hardware exception (page fault, GPF) within a Tier 1 isolation domain
- OR watchdog timer expires (driver stalled for >Nms)
- OR driver returns invalid result / corrupts its ring buffer
2. ISOLATE
- ISLE Core revokes the faulting driver's isolation domain by setting
the access-disable bit for that domain's key in the domain register
(x86: set AD bit in PKRU; AArch64: clear overlay permissions in POR_EL0;
ARMv7: set domain to "No Access" in DACR; PPC32: invalidate segment;
PPC64LE: switch to revoked PID; RISC-V/fallback: unmap driver pages)
- Driver can no longer access any memory in its domain
- Interrupt lines for this driver are masked
3. DRAIN PENDING I/O
- All pending requests from user space are completed with -EIO
- Applications receive error codes, not crashes
- io_uring CQEs are posted with error status
4. DEVICE RESET
- Issue Function-Level Reset (FLR) via PCIe
- OR vendor-specific device reset sequence
- Device returns to known-good state
5. UNLOAD DRIVER
- Free all driver-private memory
- Release all driver capabilities
- Unmap driver MMIO regions
6. RELOAD DRIVER
- Load fresh copy of driver binary
- New bilateral vtable exchange
- Device re-initialization
- Re-register interrupt handlers
7. RESUME
- New driver begins accepting I/O requests
- Applications retry failed operations (standard I/O error handling)
TOTAL RECOVERY TIME: ~50ms typical (soft-reset path) to ~150ms (FLR path)
Recovery timing breakdown — The ~50ms figure applies to the soft-reset path where the driver performs a vendor-specific device reset (register write + status poll) without a full PCIe Function Level Reset. Many devices (Intel NICs, AHCI controllers) support fast software reset in 1-10ms. The full PCIe FLR path takes longer: the PCIe spec requires the function to complete FLR within 100ms (the device must not be accessed until FLR completes; software polls the device's configuration space to detect completion). With driver reload overhead, the FLR path totals ~150ms. ISLE prefers the soft-reset path when the driver crash was a software bug (the device hardware is fine); FLR is used when the device itself appears hung (no response to MMIO reads, completion timeout). In either case, the recovery is 100-1000x faster than a full Linux reboot (30-60s).
ISLE Tier 2 Recovery Sequence
Tier 2 (user-space process) driver recovery is even simpler:
1. Driver process crashes (SIGSEGV, SIGABRT, etc.)
2. ISLE Core's driver supervisor detects process exit
3. REVOKE DEVICE ACCESS
- Mark the device as "in recovery" in the device registry, preventing
any new MMIO mappings or device access grants for this device.
- Revoke the driver's IOMMU entries (tear down the device's IOMMU
domain mappings). Any in-flight DMA that completes after this point
hits an IOMMU fault and is dropped.
- If the dying process's teardown has not yet completed MMIO unmapping
(page table entry removal + TLB shootdown), force-invalidate the
relevant page table entries. In practice, the process is already
exiting at step 2, so MMIO unmapping is a cleanup operation — the
device registry marking and IOMMU revocation are what actually
prevent further device access.
4. Pending I/O completed with -EIO
5. Supervisor restarts driver process
6. New process re-initializes device, resumes service
TOTAL RECOVERY TIME: ~10ms
Why Tier 2 is faster than Tier 1 -- Counter-intuitively, the "weaker" isolation tier recovers faster. The reason is that Tier 2 recovery skips the most expensive step in the Tier 1 sequence: no device FLR in the normal case. Tier 2 drivers have direct MMIO access to their device's BAR regions (for performance), but MMIO revocation (step 3 above) cuts off device access immediately. The IOMMU prevents any DMA initiated through those MMIO registers from reaching non-driver memory, so there is no DMA safety hazard even if the device has in-flight operations.
IOTLB coherence and DMA page lifetime -- A lightweight IOMMU invalidation (not a
full drain fence) suffices at step 3 because Tier 2 recovery defers freeing the
crashed driver's DMA pages rather than draining all in-flight DMA. After IOMMU entry
revocation, stale IOTLB entries may still allow in-flight DMA to complete to the old
physical addresses. If those pages were freed immediately, this would be a
use-after-free via hardware. Instead, the old DMA pages remain allocated (owned by the
kernel, not the dead process) until the replacement driver instance calls init() and
either reuses them (warm restart via the state buffer) or explicitly releases them back
to the allocator. By the time pages are actually freed, the IOTLB has long since been
flushed — either by the invalidation at step 3, by natural IOTLB eviction, or by the
new driver's own IOMMU setup. This makes the IOTLB coherence window moot without
requiring a synchronous drain fence.
If the device appears hung after the Tier 2 crash (the replacement driver's init()
detects an unresponsive device), the registry escalates to FLR, but this fallback is
rare. Tier 2 recovery is typically "revoke mappings, restart the process, reconnect to
the ring buffer" -- a ~10ms operation dominated by process creation and driver
init().
State Preservation and Checkpointing
Driver recovery (Section 9 steps 1–6) restarts a new driver instance, but without state preservation the new instance starts cold — losing in-flight I/O, device configuration, and connection state. ISLE uses a Theseus-inspired state spill design to enable warm restarts.
State buffer — Each Tier 1 driver has an associated kernel-managed "state buffer" that resides outside the driver's isolation domain. The buffer is allocated by isle-core and mapped read-write into the driver's address space. On crash, the isolation domain is destroyed but the state buffer survives (it belongs to isle-core).
Driver Isolation Domain (destroyed on crash) isle-core (survives)
┌─────────────────────────┐ ┌──────────────────────┐
│ Driver code + heap │ checkpoint → │ State Buffer │
│ Internal caches │ ──────────→ │ ┌────────────────┐ │
│ (NOT preserved) │ │ │ Version: 3 │ │
│ │ │ │ DevCmdQueue[] │ │
│ │ │ │ RingBufPos │ │
│ │ │ │ ConnState[] │ │
│ │ │ │ HMAC Tag │ │
└─────────────────────────┘ │ └────────────────┘ │
└──────────────────────┘
State buffer format: - Driver-defined structure (the driver author decides what to checkpoint). - Versioned via KABI version field — the state buffer header includes a format version number so a newer driver binary can detect and handle (or reject) state from an older version. - HMAC-SHA256 integrity tag — computed by isle-core using a per-driver key, verified before handing to the new driver instance. Corrupt or tampered buffers are discarded. The HMAC key is generated by isle-core on the first load of a driver for a given DeviceHandle. The key is stored in the DeviceNode (Section 8 Device Registry) and persists across driver crash/reload cycles. The key is only discarded when the DeviceHandle is removed from the registry (device unplugged or explicitly deregistered). On reload, isle-core verifies the existing state buffer using the persisted key, then continues using the same key for the new driver instance. The driver writes state data, but only isle-core can produce valid integrity tags, preventing a buggy driver from poisoning the state buffer with corrupted data. Note: Tier 1 drivers run in Ring 0, so a deliberately compromised driver (with arbitrary code execution) could read the HMAC key from isle-core memory by bypassing MPK via WRPKRU (Section 3, WRPKRU threat model). This is within the documented Tier 1 threat model — MPK provides crash containment, not exploitation prevention. The HMAC protects state buffer integrity against bugs (the common case), not against active exploitation (which requires Tier 2 for defense).
Checkpoint frequency: - Configurable per-driver. Default: checkpoint after every I/O batch completion, or every 1ms, whichever comes first. - Checkpoint is a memcpy from driver-local structures to the inactive state buffer slot (~1–4 KB typical) plus an atomic doorbell write. At 1ms intervals, the overhead is negligible.
Torn checkpoint protection (double buffering):
The driver cannot compute the HMAC (only isle-core can), so a driver crash mid-write would leave a torn (partially written) state buffer. To prevent this, the state buffer uses a double-buffering protocol:
- The state buffer contains two slots (A and B). At any time, one slot is active (the last successfully checkpointed state) and the other is inactive (the write target for the next checkpoint).
- The driver writes its checkpoint data to the inactive slot. When the write is complete, the driver signals isle-core by writing a completion flag to a shared doorbell — a single atomic write visible to isle-core.
- Isle-core, on observing the doorbell (polled during periodic work or on driver crash), computes HMAC-SHA256 over the completed slot and atomically swaps the active slot pointer.
- On crash recovery, isle-core verifies the active slot's HMAC. If valid, that state is used for the new driver instance. If invalid (corruption or incomplete swap), isle-core falls back to the previous active slot, which still holds the last known-good checkpoint.
- The double-buffer swap is an atomic pointer update. There is no race with driver writes because the driver only ever writes to the inactive slot.
- After ringing the doorbell, the inactive slot is considered "pending" -- the driver must not begin a new checkpoint until isle-core completes the swap and clears the doorbell flag. If the next 1 ms checkpoint interval arrives while a swap is still pending, the driver skips that checkpoint cycle. In practice, isle-core processes the doorbell within microseconds (HMAC-SHA256 on 4 KB takes ~0.5--1 us), so skipped checkpoints are rare.
TOCTOU mitigation (verify-then-use atomicity):
The state buffer is mapped read-write into the driver's address space, which creates a potential Time-Of-Check-Time-Of-Use (TOCTOU) vulnerability: a compromised driver could modify the active slot after isle-core verifies the HMAC but before the new driver instance reads it. ISLE prevents this attack through the following mechanisms:
-
Slot revocation on crash: When a driver crashes, isle-core immediately revokes the crashed driver's write access to both state buffer slots by unmapping the entire state buffer from the old isolation domain. This is step 2 of the recovery sequence (Section 9) — it happens before HMAC verification (step 4). After revocation, the crashed driver's code cannot execute and its page tables are destroyed, so there is no entity that can modify the buffer between verification and use.
-
Copy-on-verify to kernel-private storage: After HMAC verification succeeds, isle-core copies the verified slot contents to a kernel-private buffer (not mapped into any driver's address space). The new driver instance receives a read-only snapshot of this copy, not a pointer to the original state buffer. This ensures that even if an attacker could somehow gain write access to the original buffer (which they cannot, per point 1), the verified data cannot be altered.
-
New driver isolation: The new driver instance is created with a fresh isolation domain. The state buffer is not mapped into this new domain until after the new driver calls
init()and signals that it has finished consuming the checkpoint data. During initialization, the driver reads from the kernel-private copy (provided via a read-only mapping or explicit copy to the driver's local heap). Only afterinit()returns successfully does isle-core map the state buffer (both slots) read-write into the new driver's address space for future checkpoints. -
Atomicity guarantee: The sequence — unmap from old domain, verify HMAC, copy to kernel-private storage, create new domain — is performed with preemption disabled on the recovery CPU. There is no window during which any user-space code (driver or otherwise) can execute while holding write access to the verified buffer.
This design ensures that HMAC verification and data consumption are effectively atomic: once verified, the data cannot be modified by any entity before the new driver reads it. The cost is one additional memcpy (~4 KB) per recovery, which is negligible compared to the overall recovery latency (~50-150 ms).
HMAC-SHA256 performance: - HMAC-SHA256 computation (~0.5-1 us for a 4 KB state buffer) is performed by isle-core asynchronously — not on the driver's hot path. The driver's checkpoint cost is limited to the memcpy plus an atomic doorbell write.
What is preserved vs. rebuilt:
| Preserved (in state buffer) | NOT preserved (rebuilt from scratch) |
|---|---|
| Device command queue positions | Driver-internal caches |
| Hardware register snapshots | Deferred work queues |
| In-flight I/O descriptors | Timers and timeout state |
| Ring buffer head/tail pointers | Debug/logging state |
| Connection/session state | Statistics counters (reset to zero) |
| Device configuration (MTU, features, etc.) |
NVMe example: - Checkpointed: submission queue tail doorbell position, completion queue head position, in-flight command IDs with their scatter-gather lists, namespace configuration. - On reload: new driver reads state buffer, re-maps device BARs, verifies queue state against hardware registers, and resumes submission. In-flight commands that were submitted but not completed are re-issued.
NIC example: - Checkpointed: active flow table entries, RSS (Receive Side Scaling) indirection table and hash key, interrupt coalescing settings, VLAN filter table, MAC address list. - On reload: new driver re-programs the NIC with the checkpointed configuration. Active TCP connections see a brief pause (~50-150ms) but do not reset — the connection state lives in isle-net (Tier 1), not in the NIC driver.
Fallback: - If HMAC verification of the state buffer fails, or the version is incompatible, the new driver instance performs a cold restart (current behavior: full device reset, all in-flight I/O returned as -EIO). - Cold restart is always safe — state preservation is an optimization, not a requirement.
Crash Dump Infrastructure
When isle-core itself faults (not a driver — the core kernel), the system needs to capture diagnostic state for post-mortem analysis. Unlike driver crashes (which are recoverable), a core panic is fatal.
Reserved memory region:
- At boot, ISLE reserves a contiguous physical memory region for crash dumps, configured
via boot parameter: isle.crashkernel=256M (similar to Linux crashkernel=).
- This region is excluded from the normal physical memory allocator — it survives a
warm reboot if the firmware doesn't clear RAM.
Panic sequence:
1. Core panic triggered (null deref, assertion failure, double fault, etc.)
2. Disable interrupts on all CPUs (IPI NMI broadcast)
3. Panic handler (Tier 0 code, always resident, minimal dependencies):
a. Save register state for the faulting CPU:
- x86-64: GPRs, CR3, IDTR, RSP, RFLAGS, RIP, segment selectors
- AArch64: GPRs (x0-x30), SP_EL1, ELR_EL1, SPSR_EL1, ESR_EL1, FAR_EL1
- ARMv7: GPRs (r0-r15), CPSR, DFAR, DFSR, IFAR, IFSR
- RISC-V: GPRs (x0-x31), sepc, scause, stval, sstatus, satp
b. Walk the stack, generate backtrace (using .eh_frame / DWARF unwind info)
c. Snapshot key data structures:
- Active process list + their states
- Capability table summary
- Driver registry state
- IRQ routing table
- Recent ring buffer entries (last 64KB of klog)
d. Write all of the above into the reserved crash region as an ELF core dump
4. Flush panic message to serial console (already works in current implementation)
5. If a pre-registered NVMe region exists (configured at boot):
a. Use the NVMe driver's Tier 0 "panic write" path (polled mode, no interrupts)
b. Write the crash dump from reserved memory to the NVMe region
6. Halt or reboot (configurable: `isle.panic=halt|reboot`, default: halt)
Crash stub: - The panic handler is Tier 0 code: statically linked, no dynamic dispatch, no allocation, no locks (or only try-lock with immediate fallback). It must work even if the heap, scheduler, or interrupt subsystem is corrupted. - Serial output always works (Tier 0 serial driver, polled mode). - NVMe panic write uses polled I/O (no interrupts, no completion queues) — a simplified write path that can function with a partially-corrupted kernel.
Next boot recovery:
1. Bootloader loads ISLE kernel
2. Early init checks the reserved crash region for a valid dump header
3. If found:
a. Copy dump to a temporary in-memory buffer
b. After filesystem mount, write to /var/crash/isle-dump-<timestamp>.elf
c. Log "Previous crash dump saved to /var/crash/isle-dump-<timestamp>.elf"
d. Clear the reserved crash region
4. The dump can be analyzed with standard tools:
- `crash` utility (same as Linux kdump analysis)
- GDB with the ISLE kernel debug symbols
- `isle-crashdump` tool (ISLE-specific, extracts structured summaries)
Dump format:
- ELF core dump format, compatible with the crash utility and GDB.
- Contains: register state, memory regions (kernel text, data, stack pages for active
threads, page tables), and a note section with ISLE-specific metadata (kernel version,
boot parameters, uptime, driver state).
No kexec on day one: - Linux uses kexec to boot a second "crash kernel" that writes the dump. This is reliable but complex. - ISLE uses a simpler "in-place dump" to reserved memory: the panic handler writes directly to the reserved region without booting a second kernel. - kexec-based crash dump is a future enhancement for systems where the in-place approach is insufficient (e.g., very large memory dumps requiring a full kernel to compress and transmit).
Recovery Comparison
| Scenario | Linux | ISLE |
|---|---|---|
| NVMe driver null deref | Kernel panic, full reboot | Reload driver, ~50-150ms |
| NIC driver infinite loop | System freeze | Watchdog kill, reload, ~50-150ms |
| USB driver buffer overflow | Kernel panic | Restart process, ~10ms |
| FS driver corruption | Kernel panic + fsck | Reload driver, fsck on mount |
| Audio driver crash | Kernel panic | Restart process, ~10ms |
Crash History and Auto-Demotion
The kernel tracks per-driver crash statistics:
crash_count[driver_id] within window (default: 1 hour)
0-2 crashes: Reload at same tier
3+ crashes: Demote to next lower tier (if minimum_tier allows)
Log warning, notify admin
5+ crashes: Transition to Quarantined (driver permanently disabled); manual re-enable via sysfs. Log critical alert
A Tier 1 driver that crashes 3 times is demoted to Tier 2 (full process isolation), accepting the performance penalty for increased safety. An administrator can override this policy.
See also: Section 52 (Live Kernel Evolution) extends crash recovery to proactively replace core kernel components at runtime, reusing the same state-export/reload mechanism. Section 39 (Fault Management) adds predictive telemetry and diagnosis before crashes occur.
10. USB Class Drivers and Mass Storage
USB devices follow a class-based driver model. The USB host controller driver (xHCI for USB 3.x, EHCI for USB 2.0) is a Tier 1 platform driver that manages host controller hardware and the root hub. Class drivers are layered above it and bind to devices by USB class code, subclass, and protocol — not by vendor/product ID — giving a single driver coverage across all standards-compliant devices of a class.
10.1 USB Host Controller (xHCI, Tier 1)
The xHCI driver (USB 3.2 specification) manages:
- Transfer ring management: each endpoint has a ring buffer (producer/consumer pointers in memory). The driver enqueues Transfer Request Blocks (TRBs); the controller processes them and posts Transfer Event TRBs to the Event Ring.
- Command ring: host-issued commands (Enable Slot, Disable Slot, Configure Endpoint, Reset Device) use a separate command ring.
- Interrupt moderation: MSI-X per-interrupter; Event Ring Segment Table (ERST) maps event ring memory to the controller.
Device enumeration: root hub port status change → enumerate device at default
address 0 → GET_DESCRIPTOR (device, configuration, interface, endpoint) →
assign address via SET_ADDRESS → bind class driver based on bDeviceClass or
bInterfaceClass.
10.2 USB Mass Storage (UMS) and USB Attached SCSI (UAS)
Both protocols expose USB storage devices as block devices to isle-block.
UMS (USB Mass Storage, Bulk-Only Transport):
- Wraps SCSI commands in a Command Block Wrapper (CBW) sent over a bulk-out
endpoint; device responds with data and a Command Status Wrapper (CSW) on
bulk-in. One outstanding command at a time.
- Device registers as BlockDevice with isle-block upon successful SCSI
INQUIRY → READ CAPACITY(16) sequence.
UAS (USB Attached SCSI, USB 3.0+):
- Four-endpoint protocol (command, status, data-in, data-out). Multiple
outstanding commands (up to 65535 via stream IDs). Significantly higher
throughput and lower latency than UMS for fast SSDs.
- Preferred over UMS when both are supported (bInterfaceProtocol = 0x62).
- Same BlockDevice registration as UMS; isle-block sees no difference.
Hotplug: USB device removal triggers an Unregister event in the device
registry (§7). The volume layer (§29) transitions dependent block devices to
DEVICE_FAILED state. Auto-mount/unmount policy is handled by a userspace
daemon (udev-compatible via isle-compat) reacting to device registry events.
Tier classification: UMS/UAS drivers are Tier 2 — they communicate over USB (inherently higher latency than PCIe), and the attack surface of USB storage firmware justifies full process isolation over the modest CPU overhead.
10.3 USB4 and Thunderbolt
USB4 (based on Thunderbolt 3 protocol) and Thunderbolt 3/4 are high-bandwidth interconnects (40 Gbps) that tunnel multiple protocols — PCIe, DisplayPort, USB — over a single cable. They are relevant across server (external NVMe enclosures, 40GbE NICs), workstation (external GPUs), and embedded (dock stations) contexts.
Architecture: A USB4/Thunderbolt port is controlled by a retimer/router chip with its own firmware. The host-side driver configures the router and establishes tunnels. The tunneled protocols then appear as native devices:
Physical cable (USB4/TB4)
└── USB4 router (host controller + retimer firmware)
├── PCIe tunnel → appears as PCIe device (NVMe, GPU, NIC)
├── DisplayPort tunnel → appears as DP connector (§62.3, `15-user-io.md`)
└── USB tunnel → appears as USB hub → USB class devices
Kernel responsibilities:
-
Router enumeration: Discover USB4 routers via their management interface (MMIO registers or USB control endpoint). Read router topology descriptor to find upstream/downstream adapters and their capabilities.
-
IOMMU enforcement (mandatory for PCIe tunnels): Before establishing a PCIe tunnel to an external device, the kernel allocates an IOMMU domain for the tunneled device. The PCIe device behind the tunnel is treated identically to a native PCIe device — it gets its own IOMMU domain, its own device registry entry, and its driver follows the normal Tier 1/2 model. IOMMU protection is not optional; external PCIe devices are untrusted by definition.
-
Tunnel authorization: The kernel blocks PCIe tunnel establishment until an authorization signal is received via sysfs:
/sys/bus/thunderbolt/devices/<device>/authorizedWriting1authorizes the device; writing0de-authorizes and tears down the tunnel. This is the kernel's policy interface — what triggers the write (user prompt, pre-approved list, automatic trust) is userspace policy. -
Hotplug lifecycle:
- Connect: router detects device → kernel enumerates → IOMMU domain allocated → authorization check → tunnel established → PCIe/DP/USB device appears
- Disconnect: router reports link-down → kernel tears down tunnel → IOMMU
domain revoked → device registry
Unregisterevent → volume/display/USB layers handle disappearance gracefully
/// USB4/Thunderbolt router state.
pub struct Usb4Router {
/// Router hardware generation and capabilities.
pub gen: Usb4Generation,
/// Upstream adapter (host-facing port).
pub upstream: Usb4Adapter,
/// Downstream adapters (device-facing ports).
pub downstream: Vec<Usb4Adapter>,
/// Currently active tunnels.
pub tunnels: Vec<Usb4Tunnel>,
/// IOMMU domains for active PCIe tunnels.
pub pcie_domains: BTreeMap<Usb4AdapterId, IommuDomain>,
}
#[repr(u32)]
pub enum Usb4Generation {
Usb4Gen2 = 2, // 20 Gbps
Usb4Gen3 = 3, // 40 Gbps
Tb3 = 30, // Thunderbolt 3 (40 Gbps)
Tb4 = 40, // Thunderbolt 4 (40 Gbps, mandatory PCIe + DP)
}
pub struct Usb4Tunnel {
pub kind: Usb4TunnelKind,
pub adapter_id: Usb4AdapterId,
pub iommu_domain: Option<IommuDomain>, // Some for PCIe tunnels
}
#[repr(u32)]
pub enum Usb4TunnelKind {
Pcie = 0,
DisplayPort = 1,
Usb3 = 2,
}
Firmware updates for TB controllers: Controller firmware is updatable via
the NVM update protocol (vendor-specific, typically via the thunderbolt sysfs
interface). The kernel exposes the firmware version and provides a write interface
for firmware blobs. Actual firmware image selection and update policy is userspace.
Relationship to §47.2.5: External PCIe devices attached via USB4/Thunderbolt use the same IOMMU hard boundary and unilateral controls (bus master disable, FLR, slot power) as internal PCIe devices. If the external device runs an ISLE peer kernel (§47.2.2), it participates in the cluster exactly as an internal device would — the tunnel is transparent to the cluster protocol.
11. I2C/SMBus Bus Framework
I2C (Inter-Integrated Circuit) and SMBus (System Management Bus, a subset of I2C) are low-speed serial buses used throughout the hardware stack — in servers as well as consumer and embedded devices:
Server / datacenter uses: - BMC (Baseboard Management Controller) sensor buses: CPU, DIMM, and VRM temperature sensors; fan speed controllers; PSU monitoring - PMBus (Power Management Bus, layered on SMBus): voltage regulators, power sequencing, power rail telemetry - SPD (Serial Presence Detect): JEDEC EEPROM on each DIMM, read at boot for memory training; JEDEC JEP106 manufacturer ID, capacity, speed grade, thermal sensor register on DDR4/5 DIMMs - IPMI satellite controllers (IPMB — IPMI over I2C)
Consumer / embedded uses: - Touchpads and touchscreens (I2C-HID protocol, §11.3 below) - Audio codecs (I2C control path for volume, routing, power state) - Ambient light sensors, accelerometers (shock/vibration detection) - Battery and charger controllers (Smart Battery System over SMBus)
11.1 I2C Bus Trait
Platform I2C controller drivers (Intel LPSS, AMD FCH, Synopsys DesignWare,
Broadcom BCM2835, Aspeed AST2600 BMC) implement the I2cBus trait. The trait
is in isle-core/src/bus/i2c.rs.
/// I2C device address (7-bit, right-aligned; 0x00–0x7F).
pub type I2cAddr = u8;
/// I2C transfer result.
#[repr(u32)]
pub enum I2cResult {
Ok = 0,
/// No ACK (device not present or not responding).
NoAck = 1,
/// Bus arbitration lost (multi-master collision).
ArbitrationLost = 2,
/// Timeout (clock stretching exceeded or device hung).
Timeout = 3,
InvalidParam = 4,
}
/// I2C bus trait. Implemented by platform-specific controller drivers.
pub trait I2cBus: Send + Sync {
/// Combined write-then-read (I2C repeated START).
/// Typical pattern: write register address, read value.
fn transfer(&self, addr: I2cAddr, write: &[u8], read: &mut [u8]) -> I2cResult;
fn write(&self, addr: I2cAddr, data: &[u8]) -> I2cResult {
self.transfer(addr, data, &mut [])
}
fn read(&self, addr: I2cAddr, buf: &mut [u8]) -> I2cResult {
self.transfer(addr, &[], buf)
}
}
/// Handle to a device at a fixed address on a specific I2C bus.
pub struct I2cDevice {
pub bus: Arc<dyn I2cBus>,
pub addr: I2cAddr,
}
impl I2cDevice {
pub fn read_reg(&self, reg: u8) -> Result<u8, I2cResult> {
let mut buf = [0u8];
match self.bus.transfer(self.addr, &[reg], &mut buf) {
I2cResult::Ok => Ok(buf[0]),
e => Err(e),
}
}
pub fn write_reg(&self, reg: u8, val: u8) -> I2cResult {
self.bus.write(self.addr, &[reg, val])
}
/// Read a 16-bit little-endian register (common on SMBus devices).
pub fn read_reg16_le(&self, reg: u8) -> Result<u16, I2cResult> {
let mut buf = [0u8; 2];
self.bus.transfer(self.addr, &[reg], &mut buf)?;
Ok(u16::from_le_bytes(buf))
}
}
Tier classification: I2C controller drivers are Tier 1 — they are platform-integrated and accessed from multiple other Tier 1 drivers (audio, sensor, battery). Device drivers using I2C (touchpads, sensors) follow their own tier classification based on their function.
Device enumeration: I2C devices are enumerated from ACPI (_HID, _CRS
with I2cSerialBusV2 resource) or device-tree compatible strings. The bus
manager matches each ACPI/DT node to a registered I2C device driver.
11.2 SMBus and Hardware Sensors
SMBus restricts I2C to well-defined transaction types (Quick Command, Send Byte,
Read Byte, Read Word, Block Read) and adds a PEC (Packet Error Code) byte for
data integrity. The ISLE SMBus layer wraps I2cBus and enforces SMBus
transaction semantics.
11.2.1 Hardware Monitoring (hwmon) Interface
Server and workstation motherboards expose dozens of sensors over I2C/SMBus.
ISLE provides a HwmonDevice trait analogous to Linux's hwmon subsystem:
/// A hardware monitor device (temperature, voltage, fan, current sensors).
pub trait HwmonDevice: Send + Sync {
/// Device name (e.g., "nct6779", "ina3221", "max31790").
fn name(&self) -> &str;
/// Read a temperature sensor in millidegrees Celsius.
/// Returns None if the sensor index is not present.
fn temperature_mc(&self, index: u8) -> Option<i32>;
/// Read a fan speed in RPM.
fn fan_rpm(&self, index: u8) -> Option<u32>;
/// Read a voltage in millivolts.
fn voltage_mv(&self, index: u8) -> Option<i32>;
/// Read a current in milliamperes.
fn current_ma(&self, index: u8) -> Option<i32>;
/// Set a fan PWM duty cycle (0–255).
fn set_fan_pwm(&self, index: u8, pwm: u8) -> Result<(), I2cResult>;
}
Registered HwmonDevice instances are exposed via sysfs under
/sys/class/hwmon/hwmon<N>/. Userspace daemons (fancontrol, lm-sensors,
IPMI daemons, monitoring agents like Prometheus node-exporter) read these
paths without kernel modifications. ISLE's hwmon sysfs layout is compatible
with Linux's hwmon ABI.
11.2.2 PMBus (Power Management Bus)
PMBus is a layered protocol over SMBus for communicating with power conversion devices (VRMs, PSUs, battery chargers). PMBus defines a standardised command set (PMBUS_READ_VIN, PMBUS_READ_VOUT, PMBUS_READ_IOUT, PMBUS_READ_TEMPERATURE_1, etc.) with standardised data formats.
The ISLE PMBus driver:
1. Probes devices via ACPI/DT with pmbus compatible string.
2. Reads PMBUS_MFR_ID, PMBUS_MFR_MODEL for identification.
3. Registers a HwmonDevice exposing all PMBus telemetry channels.
4. Monitors STATUS_WORD for fault conditions (over-voltage, over-current,
over-temperature, fan fault) and posts HwmonFaultEvent to the event
subsystem (§16b, 02-kernel-core.md) so userspace daemons can react.
11.2.3 DIMM SPD and Thermal Sensors
DDR4/DDR5 DIMMs have an SPD EEPROM at I2C address 0x50–0x57 (slot-indexed). The memory controller driver reads SPD at boot for training parameters. DDR4 DIMMs also expose a thermal sensor at address 0x18–0x1F via the TS3518 or compatible interface.
/// SPD EEPROM read (partial — first 256 bytes sufficient for JEDEC training).
pub fn read_spd(bus: &dyn I2cBus, slot: u8) -> Result<[u8; 256], I2cResult> {
let addr = 0x50u8 | (slot & 0x07);
let mut buf = [0u8; 256];
// SPD page select not needed for first 256 bytes on DDR4.
bus.transfer(addr, &[0x00], &mut buf)?;
Ok(buf)
}
/// DDR4 thermal sensor read (TS register, 13-bit two's complement, 0.0625°C LSB).
pub fn read_dimm_temp_mc(bus: &dyn I2cBus, slot: u8) -> Result<i32, I2cResult> {
let addr = 0x18u8 | (slot & 0x07);
let raw = I2cDevice { bus: Arc::new(bus), addr }.read_reg16_le(0x05)?;
// Bits [15:13] are flags; bits [12:4] are temperature in 1/16°C units.
let temp_raw = ((raw as i16) >> 4) as i32;
Ok(temp_raw * 625 / 10) // convert to millidegrees Celsius
}
11.3 I2C-HID Protocol
I2C-HID (HID over I2C, HIDI2C v1.0 specification) is used for touchpads, touchscreens, fingerprint readers, and other HID devices with I2C interfaces. The kernel implements the transport layer; HID report parsing is shared with the USB HID stack (§10.1).
Protocol flow:
1. ACPI reports device with PNP0C50 (_HID) or ACPI0C50; _CRS provides
I2C address, IRQ GPIO line, and descriptor register address.
2. Driver reads HID descriptor (30 bytes) from the descriptor register.
3. Driver reads HID Report Descriptor and passes it to the shared HidParser.
4. Device asserts IRQ GPIO (falling edge) when a new input report is ready.
5. ISR: reads input report from the input register address specified in
descriptor; parses via HidParser; posts InputEvent to the §15 input
ring buffer.
#[repr(C, packed)]
pub struct I2cHidDescriptor {
pub length: u16, // Must be 30
pub bcd_version: u16, // 0x0100 for v1.0
pub report_desc_len: u16,
pub report_desc_reg: u16,
pub input_reg: u16,
pub max_input_len: u16,
pub output_reg: u16,
pub max_output_len: u16,
pub cmd_reg: u16,
pub data_reg: u16,
pub vendor_id: u16,
pub product_id: u16,
pub version_id: u16,
_reserved: [u8; 4],
}
The full I2cHidDevice implementation and interrupt handler:
// isle-core/src/hid/i2c_hid.rs
/// I2C-HID driver state.
pub struct I2cHidDevice {
/// I2C device handle.
pub i2c: I2cDevice,
/// Descriptor (fetched at probe time).
pub desc: I2cHidDescriptor,
/// Interrupt GPIO line (from ACPI `_CRS` GpioInt resource).
pub irq_gpio: GpioLine,
/// HID report descriptor (fetched at probe time).
pub report_desc: Vec<u8>,
/// Parsed HID report parser state.
pub parser: HidParser,
}
impl I2cHidDevice {
/// Probe an I2C-HID device. Called when ACPI reports `PNP0C50` (I2C-HID).
pub fn probe(i2c: I2cDevice, irq_gpio: GpioLine) -> Result<Self, ProbeError> {
// Read descriptor from register 0x0001.
let mut desc_buf = [0u8; 30];
i2c.bus.transfer(i2c.addr, &[0x01, 0x00], &mut desc_buf)?;
// SAFETY: I2cHidDescriptor is #[repr(C, packed)] with all u16/u8 fields,
// matching the 30-byte wire format. read_unaligned is required because the
// I2C transfer buffer may not be 2-byte aligned.
let desc: I2cHidDescriptor = unsafe { core::ptr::read_unaligned(desc_buf.as_ptr() as *const _) };
// Read HID report descriptor.
let mut report_desc = vec![0u8; desc.report_desc_len as usize];
let reg_bytes = desc.report_desc_reg.to_le_bytes();
i2c.bus.transfer(i2c.addr, ®_bytes, &mut report_desc)?;
// Parse HID report descriptor to build parser.
let parser = HidParser::parse(&report_desc)?;
// Register interrupt handler.
irq_gpio.enable_interrupt(GpioInterruptMode::FallingEdge, move || {
Self::handle_interrupt(&i2c, &desc, &parser);
})?;
Ok(Self { i2c, desc, irq_gpio, report_desc, parser })
}
/// Interrupt handler: read HID report, parse, deliver events.
fn handle_interrupt(i2c: &I2cDevice, desc: &I2cHidDescriptor, parser: &HidParser) {
// Read input report from input register.
let mut report_buf = vec![0u8; desc.max_input_len as usize];
let reg_bytes = desc.input_reg.to_le_bytes();
if i2c.bus.transfer(i2c.addr, ®_bytes, &mut report_buf) != I2cResult::Ok {
return; // Ignore read errors (spurious interrupt or device glitch).
}
// Parse HID report → InputEvent structs.
let events = parser.parse_input_report(&report_buf);
for event in events {
isle_input::post_event(event); // Write to input subsystem ring buffer (§15).
}
}
}
11.4 Precision Touchpad (PTP)
Windows Precision Touchpad devices use HID Usage Page 0x0D (Digitizers), Usage 0x05 (Touch Pad). The HID report contains: - Contact count: Number of active touches (0-10+). - Per-contact data: X/Y position (absolute, in logical units), contact width/height, pressure, contact ID. - Button state: Physical button click (if present), pad click (tap-to-click handled in userspace).
// isle-core/src/hid/touchpad.rs
/// Parsed Precision Touchpad report.
pub struct PtpReport {
/// Number of active contacts.
pub contact_count: u8,
/// Per-contact data (up to 10 simultaneous touches).
pub contacts: [PtpContact; 10],
/// Button state (bit 0 = left button, bit 1 = right button).
pub buttons: u8,
}
/// Single touch contact on a Precision Touchpad.
#[derive(Clone, Copy)]
pub struct PtpContact {
/// Contact ID (persistent across reports while finger is down).
pub id: u8,
/// Tip switch (1 = finger down, 0 = finger lifted).
pub tip: bool,
/// X position (logical units, 0 = left edge).
pub x: u16,
/// Y position (logical units, 0 = top edge).
pub y: u16,
/// Width (logical units, or 0 if not reported).
pub width: u16,
/// Height (logical units, or 0 if not reported).
pub height: u16,
}
Gesture recognition: Kernel delivers raw multi-touch HID reports via the input ring buffer. Gesture recognition (palm rejection, tap-to-click, multi-finger swipes) is handled by a userspace input library (libinput or equivalent).
12. Major Driver Subsystem Interfaces
Complex hardware categories — wireless networking, display, and audio — each require a shared kernel subsystem that multiple hardware-specific drivers plug into. This section defines the authoritative interface contracts for these subsystems. Hardware-specific driver documentation (consumer chipsets in §12a/§12b/§62/§61, server NICs in §30, etc.) specifies implementations of these contracts, not independent parallel frameworks.
Any driver implementing a subsystem interface must: - Follow the tier model (§6) using the tier specified here. - Use ISLE ring buffers (§8a) for all bulk data flows. - Implement the crash recovery callbacks defined in §7.4. - Register through the device registry (§7) rather than a subsystem-specific registration API.
12.1 Wireless Subsystem
Tier: Tier 1. Wireless I/O latency directly affects user-visible responsiveness (video calls, gaming, SSH). The ~200–500 cycle Tier 2 syscall overhead per packet is unacceptable. WiFi and cellular firmware run on-chip (not on the host CPU), so the attack surface is IOMMU-bounded — same threat model as NVMe (§8a.1).
KABI interface name: wireless_device_v1 (in interfaces/wireless_device.kabi).
// isle-core/src/net/wireless.rs — authoritative wireless driver contract
/// A wireless network device. Implemented by all wireless drivers
/// (WiFi 4/5/6/6E/7, cellular modems, 802.15.4).
pub trait WirelessDriver: Send + Sync {
// --- Identity and capabilities ---
/// Hardware address (6-byte MAC or 8-byte EUI-64 for 802.15.4).
fn mac_addr(&self) -> &[u8];
/// Supported wireless standards (bitmask).
fn capabilities(&self) -> WirelessCapabilities;
// --- Lifecycle ---
/// Bring the radio up (allocate firmware resources, enable PHY).
fn up(&self) -> Result<(), WirelessError>;
/// Take the radio down (quiesce TX/RX, release firmware resources).
fn down(&self) -> Result<(), WirelessError>;
// --- Scan and association ---
/// Request an active or passive scan on the given channels.
/// Results are delivered via the event ring (see `WirelessEvent`).
fn scan(&self, req: &ScanRequest) -> Result<(), WirelessError>;
/// Associate with a network.
fn connect(&self, params: &ConnectParams) -> Result<(), WirelessError>;
/// Disassociate from the current network.
fn disconnect(&self) -> Result<(), WirelessError>;
// --- Data path ---
/// Return the TX ring shared with isle-net. The ring is allocated
/// by the driver during `up()` from DMA-capable memory (§5.4).
fn tx_ring(&self) -> &RingBuffer<TxDescriptor>;
/// Return the RX ring shared with isle-net.
fn rx_ring(&self) -> &RingBuffer<RxDescriptor>;
// --- Power ---
/// Set the power save mode (maps to hardware PSM / DTIM skip).
fn set_power_save(&self, mode: WirelessPowerSave) -> Result<(), WirelessError>;
/// Configure Wake-on-WLAN patterns before S3 suspend.
fn set_wowlan(&self, patterns: &[WowlanPattern]) -> Result<(), WirelessError>;
// --- Statistics ---
fn stats(&self) -> WirelessStats;
}
bitflags! {
pub struct WirelessCapabilities: u32 {
const WIFI_4 = 1 << 0; // 802.11n
const WIFI_5 = 1 << 1; // 802.11ac
const WIFI_6 = 1 << 2; // 802.11ax
const WIFI_6E = 1 << 3; // 802.11ax 6 GHz
const WIFI_7 = 1 << 4; // 802.11be
const BT_5 = 1 << 8; // Bluetooth 5.x (combo chip)
const WOWLAN = 1 << 16; // Wake-on-WLAN support
const SCAN_OFFLOAD = 1 << 17; // Autonomous background scan in S0ix
}
}
#[repr(u32)]
pub enum WirelessPowerSave {
/// Radio always awake (CAM). Lowest latency, highest power.
Disabled = 0,
/// 802.11 PSM (sleep between beacons, wake on DTIM).
Enabled = 1,
/// Aggressive PSM (DTIM skipping, beacon filtering).
Aggressive = 2,
}
Event delivery: Wireless state changes (scan results, connect/disconnect,
roaming) are delivered via a per-device event ring (WirelessEvent enum) that
isle-net polls. No callbacks into driver code from the network stack.
Hardware-specific detail: §12b (WiFi — Intel/Realtek/Qualcomm/MediaTek), §12a (Bluetooth HCI), §30.3 (server-class 802.11ax access-point mode).
12.2 Display Subsystem
Tier: Tier 1 for integrated GPU display engines (Intel Gen12+, AMD DCN, ARM Mali DP). Tier 2 only for fully-offloaded display (USB DisplayLink, network display server) where the display path already crosses a process boundary.
KABI interface name: display_device_v1 (in interfaces/display_device.kabi).
// isle-core/src/display/mod.rs — authoritative display driver contract
/// A display controller device. Implemented by GPU/display drivers.
pub trait DisplayDriver: Send + Sync {
// --- Connector enumeration ---
/// Return all physical connectors managed by this display controller.
fn connectors(&self) -> &[DisplayConnector];
/// Read EDID from a connected display. Returns None if no display or
/// no DDC/CI support (driver falls back to safe-mode resolution).
fn read_edid(&self, connector_id: u32) -> Option<Vec<u8>>;
// --- Atomic modesetting (required; non-atomic paths are not supported) ---
/// Validate an atomic commit without applying it.
/// Returns Ok(()) if the hardware can execute the commit, or an error
/// describing the constraint that is violated.
fn atomic_check(&self, commit: &AtomicCommit) -> Result<(), DisplayError>;
/// Apply an atomic commit. Must be preceded by a successful `atomic_check`.
/// Blocks until the commit takes effect (next vsync, or immediately for
/// async page flips).
fn atomic_commit(&self, commit: &AtomicCommit, flags: CommitFlags)
-> Result<(), DisplayError>;
// --- Framebuffer management ---
/// Import a DMA-BUF as a scanout framebuffer. Returns a `FramebufferId`
/// used in subsequent atomic commits. The driver pins the buffer for the
/// lifetime of the framebuffer handle.
fn import_dmabuf(
&self,
fd: DmaBufHandle,
width: u32,
height: u32,
format: PixelFormat,
modifier: u64,
) -> Result<FramebufferId, DisplayError>;
/// Release a framebuffer handle (unpins the DMA-BUF).
fn destroy_framebuffer(&self, fb: FramebufferId);
// --- Display power ---
/// Set DPMS state for a connector (On / Standby / Suspend / Off).
fn set_dpms(&self, connector_id: u32, state: DpmsState)
-> Result<(), DisplayError>;
// --- Vsync events ---
/// Return the vsync event ring (one entry per completed page flip or
/// periodic vsync). Consumers: compositors, frame pacing logic.
fn vsync_ring(&self) -> &RingBuffer<VsyncEvent>;
}
/// Flags for `atomic_commit`.
bitflags! {
pub struct CommitFlags: u32 {
/// Apply on next vsync (tear-free). Default.
const VSYNC = 0;
/// Apply immediately without waiting for vsync (for cursor updates).
const ASYNC = 1 << 0;
/// Test-only: validate without applying.
const TEST_ONLY = 1 << 1;
/// Allow modesetting (resolution / refresh rate change).
const ALLOW_MODESET = 1 << 2;
}
}
DMA-BUF integration: The display subsystem consumes DMA-BUFs produced by
the GPU compute subsystem (§42) or by CPU-rendered framebuffers. The kernel
capability model (§11) gates import_dmabuf access: a process must hold
CAP_DISPLAY to present a framebuffer on a physical connector.
Hardware-specific detail: §62.3 (display: Intel i915, AMD DCN, Raspberry Pi display pipeline, USB DisplayLink).
12.3 Audio Subsystem
Tier: Tier 1. Audio I/O requires strict real-time scheduling to avoid glitches (buffer underruns). The period interrupt (64–2048 frames at 48 kHz = 1.3–42.7 ms) must fire predictably; Tier 2 syscall overhead per interrupt would consistently violate this budget at low-latency settings.
KABI interface name: audio_device_v1 (in interfaces/audio_device.kabi).
// isle-core/src/audio/mod.rs — authoritative audio driver contract
/// An audio device. Implemented by HDA controllers, USB audio class drivers,
/// HDMI audio endpoints, Bluetooth A2DP sinks (via isle-compat HCI layer),
/// and virtual audio devices.
pub trait AudioDriver: Send + Sync {
// --- PCM streams ---
/// Negotiate a PCM stream. The driver validates that the hardware
/// supports the requested format, sample rate, channel count, and
/// period/buffer sizes, and allocates a DMA ring buffer.
fn open_pcm(&self, params: &PcmParams) -> Result<PcmStream, AudioError>;
// --- Mixer (hardware volume/routing controls) ---
/// Enumerate hardware mixer controls.
fn mixer_controls(&self) -> Vec<MixerControl>;
/// Read a mixer control value.
fn mixer_get(&self, id: u32) -> Result<i32, AudioError>;
/// Write a mixer control value.
fn mixer_set(&self, id: u32, value: i32) -> Result<(), AudioError>;
// --- Jack detection ---
/// Return the jack event ring (headphone/microphone insert/remove events).
fn jack_ring(&self) -> &RingBuffer<JackEvent>;
// --- Power ---
/// Suspend audio device (silence output, power-gate ADC/DAC). Called
/// before platform S3/S0ix entry.
fn suspend(&self) -> Result<(), AudioError>;
/// Resume audio device. Called after platform resume.
fn resume(&self) -> Result<(), AudioError>;
}
/// Parameters for opening a PCM stream.
#[repr(C)]
pub struct PcmParams {
pub direction: PcmDirection, // Playback or Capture
pub format: PcmFormat, // S16Le / S24Le / S32Le / F32Le
pub rate: u32, // Hz: 44100, 48000, 96000, 192000
pub channels: u8, // 1 (mono) … 8 (7.1)
pub period_frames: u32, // Interrupt granularity (power of 2)
pub buffer_frames: u32, // Ring buffer size (multiple of period)
}
ALSA compatibility: isle-compat translates snd_pcm_*, snd_ctl_*, and
snd_rawmidi_* ioctls to AudioDriver calls, enabling PipeWire, PulseAudio,
and JACK to run unmodified.
Hardware-specific detail: §61.3 (audio: Intel HDA, USB Audio Class, HDMI/DP audio endpoint).
12.4 GPU Compute
Tier: Tier 1. GPU memory management (IOMMU domain assignment, VRAM eviction, TDR recovery) must execute in kernel context. Kernel-bypass command submission is the only path that meets the latency requirements of interactive rendering and GPGPU workloads. A Tier 2 boundary crossing on every submit would add unacceptable per-frame overhead.
KABI interface name: gpu_device_v1 (in interfaces/gpu_device.kabi).
// isle-core/src/gpu/mod.rs — authoritative GPU driver contract
/// A GPU device. Implemented by drivers for discrete and integrated GPUs
/// (Intel Xe, AMD GCN/RDNA, Arm Mali Valhall, NVIDIA GSP, etc.).
pub trait GpuDevice: Send + Sync {
// --- Context management ---
/// Allocate a GPU context for the calling process. A context owns a
/// private GPU virtual address space backed by a dedicated IOMMU domain.
/// It is the unit of isolation: a fault in one context cannot corrupt
/// another. The kernel destroys the context (and all buffer objects mapped
/// into it) when the owning process exits.
///
/// Requires `CAP_GPU_RENDER`.
fn alloc_ctx(&self) -> Result<GpuContext, GpuError>;
/// Destroy a GPU context and release all GPU VA space. Any buffer objects
/// mapped into the context are unmapped but not freed; the caller must
/// drop the `BufferObject` handles separately.
fn free_ctx(&self, ctx: GpuContext) -> Result<(), GpuError>;
// --- Buffer object lifecycle ---
/// Allocate a buffer object. `size` is in bytes (page-aligned). `placement`
/// controls where physical backing is sourced (VRAM, GTT, or system
/// memory). `tiling` sets the hardware tiling modifier; use
/// `TilingModifier::Linear` unless the caller has negotiated a tiling
/// format with the display subsystem (§12.2).
///
/// Requires `CAP_GPU_RENDER`.
fn alloc_bo(
&self,
size: usize,
placement: BoPlacementFlags,
tiling: TilingModifier,
) -> Result<BufferObject, GpuError>;
/// Free a buffer object. The BO must have been unmapped from all GPU VA
/// spaces before calling this. Returns `GpuError::StillMapped` if not.
fn free_bo(&self, bo: BufferObject) -> Result<(), GpuError>;
// --- GPU virtual address space ---
/// Map a buffer object into a GPU context's virtual address space.
/// `va_hint` is advisory; the driver may choose a different VA if the
/// hint conflicts with an existing mapping. Returns the actual GPU VA.
///
/// The mapping remains valid until `unmap_bo` is called or the context
/// is destroyed.
fn map_bo(
&self,
ctx: &GpuContext,
bo: &BufferObject,
va_hint: Option<u64>,
flags: BoMapFlags,
) -> Result<u64, GpuError>;
/// Unmap a buffer object from a GPU context's virtual address space.
fn unmap_bo(&self, ctx: &GpuContext, va: u64) -> Result<(), GpuError>;
// --- Command submission ---
/// Submit a command buffer for execution on the GPU. `exec_queue`
/// selects the hardware engine (graphics, compute, copy, video).
/// `wait_fences` is a list of `GpuFence` values that must be signaled
/// before execution begins. Returns a `GpuFence` that is signaled when
/// the command buffer completes.
///
/// The command buffer pointer is a GPU VA within `ctx`. The caller is
/// responsible for ensuring the GPU VA maps to valid, initialized memory.
///
/// Requires `CAP_GPU_RENDER`.
fn submit(
&self,
ctx: &GpuContext,
exec_queue: ExecQueue,
cmdbuf_va: u64,
cmdbuf_size: usize,
wait_fences: &[GpuFence],
) -> Result<GpuFence, GpuError>;
// --- DMA-BUF export ---
/// Export a buffer object as a DMA-BUF file descriptor. The returned
/// handle can be passed to the display subsystem (`DisplayDriver::
/// import_dmabuf`, §12.2) or to a video encoder (`MediaDevice::
/// queue_buf`, §12.6). The BO reference count is incremented; the BO
/// remains live until both the `BufferObject` handle and all DMA-BUF
/// importers are dropped.
fn export_dmabuf(&self, bo: &BufferObject) -> Result<DmaBufHandle, GpuError>;
// --- TDR (Timeout Detection and Recovery) ---
/// Trigger an explicit TDR cycle on a GPU context that the caller has
/// determined is hung. The kernel also calls this internally if a context
/// has not produced a progress heartbeat for 2 seconds.
///
/// Behavior:
/// 1. The driver preempts the hung context.
/// 2. The hardware engine is reset to a known-good state.
/// 3. All other active contexts are saved, the engine is reconfigured,
/// and those contexts resume from their last checkpoint.
/// 4. The hung `GpuContext` is marked invalid; any subsequent call on it
/// returns `GpuError::ContextLost` (-ENODEV).
///
/// Requires `CAP_GPU_ADMIN`.
fn tdr_reset(&self, ctx: &GpuContext) -> Result<(), GpuError>;
// --- Capability queries ---
/// Return the set of capabilities reported by the GPU hardware (memory
/// sizes, engine counts, supported tiling modifiers, etc.).
fn capabilities(&self) -> GpuCapabilities;
}
/// A GPU context: one process's private GPU virtual address space.
/// Isolated from all other contexts by a dedicated IOMMU domain.
pub struct GpuContext {
/// Opaque kernel handle. Never dereference from outside the GPU subsystem.
pub handle: u64,
/// The IOMMU domain ID assigned to this context (for cross-subsystem
/// DMA-BUF import validation).
pub iommu_domain_id: u32,
}
/// A GPU memory allocation.
pub struct BufferObject {
/// Opaque kernel handle.
pub handle: u64,
/// Size in bytes (always page-aligned).
pub size: usize,
/// Actual placement after allocation (may differ from the requested
/// placement if VRAM was full and the driver fell back to GTT).
pub actual_placement: BoPlacementFlags,
/// Tiling modifier in use (DRM format modifier encoding).
pub tiling: TilingModifier,
}
/// Timeline semaphore. A fence is signaled when `timeline.seqno >= value`.
/// Shared across §12.4 (GPU), §12.6 (Media), §12.7 (NPU), and §12.10
/// (Crypto) to allow cross-subsystem dependency chains without conversions.
pub struct GpuFence {
/// Identifies the hardware timeline (GPU engine, DMA channel, NPU, etc.).
pub timeline_id: u64,
/// The sequence number on that timeline that must be reached.
pub seqno: u64,
}
bitflags! {
/// Where to source backing memory for a buffer object.
pub struct BoPlacementFlags: u32 {
/// GPU-local VRAM. Highest GPU bandwidth, not CPU-accessible without
/// a GTT mapping.
const VRAM = 1 << 0;
/// Graphics Translation Table (CPU-accessible via BAR2/GGTT aperture).
const GTT = 1 << 1;
/// System (DRAM) memory. Always CPU-accessible; lowest GPU bandwidth.
const SYSTEM = 1 << 2;
}
}
bitflags! {
/// Flags controlling a GPU VA mapping.
pub struct BoMapFlags: u32 {
/// GPU may read the buffer.
const READ = 1 << 0;
/// GPU may write the buffer.
const WRITE = 1 << 1;
/// CPU cache is coherent with GPU (requires hardware support; falls
/// back to uncached if not available).
const COHERENT = 1 << 2;
}
}
#[repr(u32)]
/// Hardware tiling modifier (DRM format modifier encoding, lower 32 bits).
pub enum TilingModifier {
/// No tiling — linear row-major layout.
Linear = 0,
/// Intel X-tiling (128-byte columns × 8 rows).
IntelXTile = 1,
/// Intel Y-tiling (32-byte columns × 32 rows, preferred for render).
IntelYTile = 2,
/// AMD DCC (Delta Color Compression — requires matching display engine).
AmdDcc = 3,
/// Arm Afbc (Arm Frame Buffer Compression).
ArmAfbc = 4,
}
#[repr(u32)]
/// GPU hardware engine selector for command submission.
pub enum ExecQueue {
/// 3D rendering and compute shaders (universal queue on most GPUs).
Graphics = 0,
/// Dedicated compute queue (no graphics state, runs in parallel).
Compute = 1,
/// Blitter / copy engine (lower power for buffer-to-buffer transfers).
Copy = 2,
/// Video decode engine.
VideoDec = 3,
/// Video encode engine.
VideoEnc = 4,
}
TDR model: The watchdog timer fires every 2 seconds. If a GPU context has
not advanced its hardware progress counter since the last tick, the kernel
invokes tdr_reset() on that context. The 2-second threshold is adjustable
per-device via sysfs (/sys/class/gpu/<dev>/tdr_timeout_ms) by a process
holding CAP_GPU_ADMIN. Reducing below 100 ms is not permitted; doing so
would produce false positives during legitimate shader compilation stalls.
Cross-driver synchronization: GpuFence is the universal cross-driver
timeline primitive. The display subsystem (§12.2) accepts a GpuFence in
atomic_commit to defer scanout until rendering completes. The NPU subsystem
(§12.7) and the crypto engine (§12.10) both use the same GpuFence struct
so that inference pipelines and encrypted content pipelines can express
multi-stage dependency chains in a single data structure.
Capability gating: CAP_GPU_RENDER is required for context allocation,
BO allocation, and command submission. CAP_GPU_ADMIN is additionally
required for clock control, performance counter access, and explicit TDR.
Both capabilities are checked in the kernel before any hardware register is
touched.
Hardware-specific detail: §67.6 (Intel Xe / i915), §67.7 (AMD AMDGPU), §67.8 (Arm Mali Valhall), §67.9 (NVIDIA GSP open-source driver).
12.5 RDMA
Tier: Tier 1. RDMA's defining property is that the hot path (posting work requests, ringing doorbells) never enters the kernel. This requires that the kernel map QP doorbell pages and work-request memory regions directly into userspace. Protection domain management, memory region pinning, and IOMMU programming must therefore reside in the kernel.
KABI interface name: rdma_device_v1 (in interfaces/rdma_device.kabi).
// isle-core/src/rdma/mod.rs — authoritative RDMA driver contract
/// An RDMA-capable network device (InfiniBand HCA, RoCEv2 NIC, iWARP adapter).
/// Implemented by drivers such as Mellanox/NVIDIA mlx5, Broadcom bnxt_re,
/// Intel irdma, and Marvell qedr.
pub trait RdmaDevice: Send + Sync {
// --- Protection domain ---
/// Allocate a protection domain. A PD is the unit of authorization:
/// memory regions, queue pairs, and address handles all belong to exactly
/// one PD. Objects in different PDs cannot communicate without explicit
/// cross-registration (which is not currently supported).
///
/// Requires `CAP_RDMA`.
fn alloc_pd(&self) -> Result<ProtectionDomain, RdmaError>;
/// Free a protection domain. All child objects (MRs, QPs, AHs) must
/// have been freed before calling this; returns `RdmaError::PdInUse`
/// if any remain.
fn dealloc_pd(&self, pd: ProtectionDomain) -> Result<(), RdmaError>;
// --- Memory regions ---
/// Register a memory region. The kernel pins the pages covering
/// `[addr, addr + length)` in the calling process's address space,
/// programs the IOMMU to allow DMA from the device, and returns the
/// local key (`lkey`, for local SGE references) and remote key (`rkey`,
/// for remote RDMA operations targeting this region). Both keys are
/// opaque 32-bit values; their encoding is device-specific.
///
/// `access` controls what operations remote peers may perform via
/// the rkey (read, write, atomic). Local access via lkey always
/// allows reads and writes.
///
/// Requires `CAP_RDMA`.
fn alloc_mr(
&self,
pd: &ProtectionDomain,
addr: usize,
length: usize,
access: MrAccessFlags,
) -> Result<MemoryRegion, RdmaError>;
/// Deregister a memory region. Pages are unpinned and the IOMMU mapping
/// is removed. Any in-flight RDMA operation targeting this MR will
/// complete with a remote access error on the peer side.
fn dealloc_mr(&self, mr: MemoryRegion) -> Result<(), RdmaError>;
// --- Completion queues ---
/// Create a completion queue with capacity for at least `cqe` entries.
/// The driver may round up to a hardware-convenient size. The actual
/// capacity is returned in `CompletionQueue::capacity`.
///
/// Requires `CAP_RDMA`.
fn create_cq(&self, cqe: u32) -> Result<CompletionQueue, RdmaError>;
/// Destroy a completion queue. All QPs that reference this CQ must be
/// destroyed first.
fn destroy_cq(&self, cq: CompletionQueue) -> Result<(), RdmaError>;
// --- Queue pairs ---
/// Create a queue pair (send queue + receive queue) associated with the
/// given protection domain and completion queues. `init_attr` specifies
/// QP type, initial queue depths, and scatter-gather element counts.
///
/// The QP is created in the RESET state. Call `modify_qp` to transition
/// it to INIT → RTR → RTS before posting work requests.
///
/// Requires `CAP_RDMA`.
fn create_qp(
&self,
pd: &ProtectionDomain,
send_cq: &CompletionQueue,
recv_cq: &CompletionQueue,
init_attr: &QpInitAttr,
) -> Result<QueuePair, RdmaError>;
/// Transition a queue pair through the state machine (RESET→INIT→RTR→RTS,
/// or error paths). `attr_mask` indicates which fields in `attr` are valid.
fn modify_qp(
&self,
qp: &mut QueuePair,
attr: &QpAttr,
attr_mask: QpAttrMask,
) -> Result<(), RdmaError>;
/// Destroy a queue pair. Any posted work requests are silently discarded.
fn destroy_qp(&self, qp: QueuePair) -> Result<(), RdmaError>;
// --- Kernel-bypass doorbell mapping ---
/// Map the QP doorbell page into the calling process's virtual address
/// space. Returns the userspace virtual address of the doorbell MMIO page.
/// The process writes work requests to the QP memory (already mapped via
/// `mmap` of the QP backing pages) and then writes a 64-bit descriptor to
/// the doorbell address to ring the hardware. No syscall is needed on the
/// hot path.
///
/// The mapping is automatically removed when the QP is destroyed or the
/// process exits.
///
/// Requires `CAP_RDMA`.
fn map_qp_doorbell(&self, qp: &QueuePair) -> Result<*mut u8, RdmaError>;
// --- Kernel-side slow path (setup and error recovery only) ---
/// Post receive work requests to the QP's receive queue. Used only
/// during initialization and after QP error recovery; the normal path
/// posts directly from userspace.
fn post_recv(
&self,
qp: &QueuePair,
wrs: &[RecvWorkRequest],
) -> Result<(), RdmaError>;
/// Post send work requests to the QP's send queue. Used only during
/// initialization and after QP error recovery.
fn post_send(
&self,
qp: &QueuePair,
wrs: &[SendWorkRequest],
) -> Result<(), RdmaError>;
// --- Port query ---
/// Query the state of a physical port. Returns link state, MTU, GID
/// table entries, port capabilities, and current speed/width.
fn query_port(&self, port_num: u8) -> Result<PortAttributes, RdmaError>;
// --- Device query ---
/// Return static device capabilities (max QPs, max CQEs, max MR size,
/// supported transport types, atomic operation support, etc.).
fn query_device(&self) -> DeviceAttributes;
}
/// A protection domain: unit of authorization for RDMA operations.
pub struct ProtectionDomain {
/// Opaque kernel handle.
pub handle: u32,
}
/// A pinned, IOMMU-mapped memory region.
pub struct MemoryRegion {
/// Opaque kernel handle.
pub handle: u32,
/// Local key: used in SGE (scatter-gather element) references.
pub lkey: u32,
/// Remote key: presented to a remote peer to authorize RDMA operations
/// targeting this region.
pub rkey: u32,
/// Base virtual address of the registered region.
pub addr: usize,
/// Length of the registered region in bytes.
pub length: usize,
}
/// A completion queue.
pub struct CompletionQueue {
/// Opaque kernel handle.
pub handle: u32,
/// Actual CQ capacity (≥ the requested `cqe`).
pub capacity: u32,
}
/// A queue pair (RC, UC, UD, or SRQ-attached RC).
pub struct QueuePair {
/// Opaque kernel handle.
pub handle: u32,
/// The QP number used by the remote peer for addressing.
pub qp_num: u32,
/// Current QP state.
pub state: QpState,
}
bitflags! {
/// Access permissions granted on a memory region to remote peers.
pub struct MrAccessFlags: u32 {
/// Remote peer may issue RDMA Read targeting this MR.
const REMOTE_READ = 1 << 0;
/// Remote peer may issue RDMA Write targeting this MR.
const REMOTE_WRITE = 1 << 1;
/// Remote peer may issue atomic operations (CAS, FAA) on this MR.
const REMOTE_ATOMIC = 1 << 2;
/// Memory window binding is allowed (for dynamic rkey invalidation).
const MW_BIND = 1 << 3;
}
}
#[repr(u32)]
/// QP state machine states (IB Architecture Specification §11.4.3).
pub enum QpState {
/// Hardware-quiesced state. No WRs are processed.
Reset = 0,
/// Initialized. Receive WRs may be posted; sends are not yet enabled.
Init = 1,
/// Ready To Receive. Path information is configured; receives are active.
Rtr = 2,
/// Ready To Send. Both sends and receives are active.
Rts = 3,
/// Send Queue Drain. QP is draining sends due to error; new WRs rejected.
Sqd = 4,
/// Send Queue Error. QP encountered an error; all WRs flushed with error.
Sqe = 5,
/// Error. Both queues have been flushed with error completions.
Err = 6,
}
Kernel-bypass model: After map_qp_doorbell() and mmap of the QP work
queue memory, userspace RDMA libraries (libibverbs, rdma-core) operate
entirely without kernel involvement on the send path. The kernel is re-entered
only for: QP state transitions, CQ overflow recovery, error handling, and
address handle (AH) creation. This model is compatible with the OpenMPI and
UCX transports used by HPC applications.
IOMMU integration: alloc_mr programs the device's IOMMU domain (same
model as §12.4 GpuContext) so that only the registered address range is
accessible to the device. A buffer overflow in an RDMA payload cannot reach
outside the registered MR.
Multikernel integration: The distributed lock manager (§31a) and the inter-node IPC transport (§47) both use RDMA as their high-speed fabric. The RDMA protection domain model maps directly to ISLE capability domains: each cluster node that participates in the multikernel has one PD per trust domain.
IB verbs compatibility: The RdmaDevice trait is a strict superset of the
IB verbs interface exposed by Linux's ib_verbs.h. The isle-compat layer
translates ibv_* library calls to the corresponding RdmaDevice methods,
allowing unmodified rdma-core, OpenMPI, and OpenFabrics applications to run.
Hardware-specific detail: §67.10 (Mellanox/NVIDIA mlx5), §67.11 (Intel irdma), §67.12 (Broadcom bnxt_re, Marvell qedr).
12.6 Video / Media Pipeline
Tier: Tier 1 for hardware codec engines (Intel Quick Sync, AMD VCN, Qualcomm Venus, Mediatek VENC/VDEC, Apple VideoToolbox-equivalent hardware). Tier 2 for pure software codecs: a CPU-based ffmpeg instance is already a userspace process and requires no special KABI beyond ordinary DMA-BUF file descriptor passing and shared memory.
KABI interface name: media_device_v1 (in interfaces/media_device.kabi).
// isle-core/src/media/mod.rs — authoritative media pipeline driver contract
/// A hardware media processing device (codec engine, ISP, or similar).
/// Implemented by drivers for SoC video IP blocks and discrete capture cards.
pub trait MediaDevice: Send + Sync {
// --- Capability discovery ---
/// Enumerate all codec configurations supported by the hardware. Each
/// entry specifies codec type, profile, level, maximum resolution,
/// maximum frame rate, and whether encode and/or decode is supported.
fn query_codecs(&self) -> Vec<CodecCapability>;
// --- Session lifecycle ---
/// Create a codec session. `config` specifies the codec, direction
/// (encode or decode), input/output pixel formats, and initial encoding
/// parameters (bitrate, QP, keyframe interval, rate control mode) for
/// encode sessions or output pixel format (NV12, P010, etc.) for decode.
///
/// Returns a `MediaSession` handle used for subsequent buffer operations.
fn create_session(
&self,
config: &SessionConfig,
) -> Result<MediaSession, MediaError>;
/// Destroy a codec session. All queued buffers are flushed and returned
/// with `BufferState::Error` before the session handle is invalidated.
fn destroy_session(&self, session: MediaSession) -> Result<(), MediaError>;
// --- Buffer queue ---
/// Submit an input buffer (as a DMA-BUF handle) to the session for
/// processing. For encode sessions the buffer contains raw video frames;
/// for decode sessions it contains compressed bitstream data.
///
/// `sequence` is a monotonically increasing caller-assigned sequence
/// number returned with the corresponding output buffer so the caller can
/// match inputs to outputs out-of-order.
fn queue_buf(
&self,
session: &MediaSession,
buf: DmaBufHandle,
sequence: u64,
flags: QueueFlags,
) -> Result<(), MediaError>;
/// Retrieve the next completed output buffer. Blocks until a buffer is
/// available or the session is destroyed. Returns the DMA-BUF handle of
/// the output, the input sequence number it corresponds to, and a
/// `GpuFence` (§12.4) that is signaled when the hardware has finished
/// writing to the buffer (the caller must wait on this fence before
/// reading the buffer contents from CPU or passing it to the display).
fn dequeue_buf(
&self,
session: &MediaSession,
) -> Result<DequeuedBuffer, MediaError>;
// --- Media graph topology ---
/// Return all pads (typed I/O ports) belonging to this device node.
fn pads(&self) -> &[MediaPad];
/// Create a directed link between an output pad of this device and an
/// input pad of another device. Both pads must be compatible (same
/// pixel format, resolution, and frame rate). Returns a `MediaLink`
/// handle. Enabling the link causes DMA-BUFs to flow from the source
/// pad to the sink pad without copying.
fn create_link(
&self,
src_pad: PadId,
sink_device: &dyn MediaDevice,
sink_pad: PadId,
format: LinkFormat,
) -> Result<MediaLink, MediaError>;
/// Destroy a link, stopping buffer flow between the two pads.
fn destroy_link(&self, link: MediaLink) -> Result<(), MediaError>;
// --- Dynamic parameter updates ---
/// Update encoding parameters on a running encode session without
/// destroying and recreating it. Only encode-direction parameters
/// (bitrate, QP range, keyframe force) may be updated this way.
fn update_encode_params(
&self,
session: &MediaSession,
params: &EncodeParams,
) -> Result<(), MediaError>;
}
/// A codec session handle.
pub struct MediaSession {
/// Opaque kernel handle.
pub handle: u64,
/// Session direction (Encode or Decode).
pub direction: CodecDirection,
}
/// A directed link between two media pads. The link transfers ownership of
/// each DMA-BUF from the source pad to the sink pad atomically.
pub struct MediaLink {
/// Opaque kernel handle.
pub handle: u32,
/// Source device pad identifier.
pub src_pad: PadId,
/// Sink device pad identifier.
pub sink_pad: PadId,
/// Negotiated format carried on this link.
pub format: LinkFormat,
}
/// A typed I/O port on a media device.
pub struct MediaPad {
/// Identifier unique within the owning device.
pub id: PadId,
/// Whether this pad produces (Source) or consumes (Sink) DMA-BUFs.
pub direction: PadDirection,
/// Set of pixel formats and frame sizes this pad can accept or produce.
pub supported_formats: Vec<PadFormat>,
}
/// A completed output buffer returned by `dequeue_buf`.
pub struct DequeuedBuffer {
/// DMA-BUF handle of the output data. For encode: compressed bitstream.
/// For decode: raw frame in the pixel format requested in `SessionConfig`.
pub buf: DmaBufHandle,
/// Caller-assigned sequence number from the corresponding `queue_buf`.
pub sequence: u64,
/// Fence signaled when hardware has finished writing to `buf`. The
/// caller MUST wait on this fence before reading or forwarding the buffer.
pub ready_fence: GpuFence,
}
/// Configuration for a new codec session.
#[repr(C)]
pub struct SessionConfig {
/// Codec type (H264, H265, AV1, VP9, JPEG, etc.).
pub codec: CodecType,
/// Encode or Decode.
pub direction: CodecDirection,
/// Input pixel format (for encode) or bitstream container (for decode).
pub input_format: MediaFormat,
/// Output pixel format (for decode: NV12, P010, etc.; for encode: N/A).
pub output_format: MediaFormat,
/// Initial encode parameters (ignored for decode sessions).
pub encode_params: EncodeParams,
}
/// Encoding parameters. All fields are writable after session creation via
/// `update_encode_params`.
#[repr(C)]
pub struct EncodeParams {
/// Target bitrate in bits per second. 0 means CQP (constant QP) mode.
pub bitrate_bps: u32,
/// Minimum quantization parameter (lower = better quality, larger frames).
pub qp_min: u8,
/// Maximum quantization parameter.
pub qp_max: u8,
/// Force a keyframe every N frames. 0 disables periodic keyframes.
pub keyframe_interval: u32,
/// Rate control mode (CBR, VBR, CQP, CRF).
pub rc_mode: RateControlMode,
}
#[repr(u32)]
pub enum CodecDirection {
/// Hardware encoder: raw frames in, compressed bitstream out.
Encode = 0,
/// Hardware decoder: compressed bitstream in, raw frames out.
Decode = 1,
}
#[repr(u32)]
pub enum RateControlMode {
/// Constant bitrate. Buffer fullness is maintained; quality varies.
Cbr = 0,
/// Variable bitrate. Average bitrate target; quality peaks on I-frames.
Vbr = 1,
/// Constant quantization parameter. Bitrate varies; quality is fixed.
Cqp = 2,
/// Constant rate factor (quality-based VBR, similar to x264 CRF).
Crf = 3,
}
bitflags! {
/// Flags for `queue_buf`.
pub struct QueueFlags: u32 {
/// Mark this buffer as the last in a stream (EOS). The session will
/// flush and return all pending output buffers after processing this
/// input.
const END_OF_STREAM = 1 << 0;
/// Force a keyframe on this input buffer (encode only).
const FORCE_KEYFRAME = 1 << 1;
}
}
Buffer graph model: A complete media pipeline is a directed acyclic graph
of MediaDevice nodes connected by MediaLink edges. DMA-BUFs flow from
source pads to sink pads without copying. A typical pipeline:
[camera sensor] → [ISP] → [encoder] → [network or file]
The ISP and encoder are separate MediaDevice instances. The link between
them carries DMA-BUFs whose lifetime is managed by the producing node; the
consuming node signals via a GpuFence when it has finished reading the
buffer so the producer can reuse it.
V4L2 M2M compatibility: isle-compat translates V4L2 memory-to-memory
device ioctls (VIDIOC_QBUF, VIDIOC_DQBUF, VIDIOC_STREAMON) on M2M
nodes to queue_buf / dequeue_buf / session start. The pixel format
negotiation (VIDIOC_S_FMT) maps to SessionConfig field selection.
Applications using libv4l2 or GStreamer's v4l2h264enc / v4l2h264dec
elements run unmodified.
Hardware-specific detail: §67.13 (Intel Quick Sync — GuC/HuC firmware), §67.14 (AMD VCN), §67.15 (Qualcomm Venus), §67.16 (Mediatek VENC/VDEC), §67.17 (camera ISP — ARM Mali C71, Qualcomm Spectra).
12.7 AI / NPU Accelerator
Tier: Tier 1. Large model weight tensors require physically contiguous DMA allocations that the kernel memory allocator must satisfy. Inference latency requirements (< 1 ms first-token for edge models) preclude the overhead of a Tier 2 boundary crossing on each inference submission.
KABI interface name: accel_device_v1 (in interfaces/accel_device.kabi).
// isle-core/src/accel/mod.rs — authoritative NPU/accelerator driver contract
/// A hardware accelerator device: NPU, DSP, or tensor processor.
/// Implemented by drivers for Qualcomm Hexagon, Intel VPU (Meteor Lake NPU),
/// Apple ANE (via open-source reimplementation), MediaTek APU, and custom
/// ASICs.
pub trait AccelDevice: Send + Sync {
// --- Buffer object management (shared model with §12.4 GPU) ---
/// Allocate a buffer object in accelerator-accessible memory. `size` is
/// in bytes (page-aligned). `placement` selects between accelerator-local
/// SRAM/DRAM, coherent system memory, or non-coherent DMA-able system
/// memory depending on what the hardware supports.
///
/// Requires `CAP_ACCEL_INFERENCE`.
fn alloc_bo(
&self,
size: usize,
placement: AccelPlacementFlags,
) -> Result<BufferObject, AccelError>;
/// Free a buffer object. Must not be in use by a model or in-flight
/// inference when freed.
fn free_bo(&self, bo: BufferObject) -> Result<(), AccelError>;
// --- Model lifecycle ---
/// Upload a pre-compiled model blob (produced by the vendor NPU compiler
/// running in userspace) to the accelerator. The blob format is
/// device-specific and opaque to the kernel; the kernel validates only
/// its size and alignment constraints. The kernel does NOT JIT-compile or
/// interpret the blob; it DMA-copies it to accelerator SRAM/DRAM and
/// registers it with the firmware.
///
/// Returns a `ModelHandle` used in subsequent `submit_inference` calls.
///
/// Requires `CAP_ACCEL_INFERENCE`.
fn load_model(
&self,
blob: DmaBufHandle,
blob_size: usize,
) -> Result<ModelHandle, AccelError>;
/// Unload a model, freeing accelerator SRAM and deregistering the model
/// from firmware. Any in-flight inference using this model must complete
/// before calling this; returns `AccelError::ModelInUse` if not.
fn unload_model(&self, model: ModelHandle) -> Result<(), AccelError>;
// --- Inference submission ---
/// Submit an inference request. `input` is a DMA-BUF containing the
/// input tensor data in the layout expected by the model (described in
/// the model blob metadata). `output` is a DMA-BUF that the accelerator
/// will write inference results to. Both buffers must be at least as
/// large as the model's declared input/output tensor sizes.
///
/// `wait_fences` lists `GpuFence` (§12.4) values that must be signaled
/// before the inference begins (e.g., a camera frame that is still being
/// written by the ISP). Returns a `GpuFence` signaled when the output
/// tensor is complete and `output` is safe to read.
///
/// Requires `CAP_ACCEL_INFERENCE`.
fn submit_inference(
&self,
model: &ModelHandle,
input: DmaBufHandle,
output: DmaBufHandle,
wait_fences: &[GpuFence],
) -> Result<GpuFence, AccelError>;
// --- Capability query ---
/// Return static device capabilities: supported data types (INT8, FP16,
/// BF16, FP32), maximum model size in bytes, maximum batch size, list of
/// supported operator sets (ONNX opset version, TFLite version, etc.),
/// and hardware performance counters layout.
fn query_capabilities(&self) -> AccelCapabilities;
// --- TDR ---
/// Reset the accelerator after a hung or timed-out inference. The kernel
/// calls this automatically when an inference does not complete within
/// the configured TDR timeout (default: 30 s for large models, adjustable
/// via `/sys/class/accel/<dev>/tdr_timeout_ms` with `CAP_ACCEL_ADMIN`).
///
/// All sessions on the device are reset. In-flight inferences return
/// `AccelError::Timeout` (-ETIMEDOUT) to their callers. If the hardware
/// supports per-session context isolation, only the hung session is
/// terminated; other sessions resume.
///
/// Requires `CAP_ACCEL_ADMIN`.
fn tdr_reset(&self) -> Result<(), AccelError>;
}
/// A loaded model handle.
pub struct ModelHandle {
/// Opaque kernel handle.
pub handle: u64,
/// Size of the model blob in bytes.
pub blob_size: usize,
/// Required input tensor size in bytes.
pub input_size: usize,
/// Required output tensor size in bytes.
pub output_size: usize,
}
/// Static capabilities of an accelerator device.
pub struct AccelCapabilities {
/// Peak INT8 throughput in tera-operations per second.
pub tops_int8: u32,
/// Peak FP16 throughput in tera-operations per second.
pub tops_fp16: u32,
/// Accelerator-local memory size in bytes (SRAM + on-package DRAM).
pub local_memory_bytes: usize,
/// Maximum single model blob size in bytes.
pub max_model_size_bytes: usize,
/// Supported numeric data types.
pub data_types: AccelDataTypeFlags,
/// Supported operator sets (bitmask: ONNX, TFLite, QNN, OpenVINO IR).
pub operator_sets: AccelOpSetFlags,
}
bitflags! {
/// Numeric data types the accelerator can execute natively.
pub struct AccelDataTypeFlags: u32 {
const INT8 = 1 << 0;
const INT16 = 1 << 1;
const FP16 = 1 << 2;
const BF16 = 1 << 3;
const FP32 = 1 << 4;
}
}
bitflags! {
/// Supported operator set languages.
pub struct AccelOpSetFlags: u32 {
/// ONNX opset (any version accepted by this device's firmware).
const ONNX = 1 << 0;
/// TensorFlow Lite flatbuffer format.
const TFLITE = 1 << 1;
/// Qualcomm QNN binary format.
const QNN = 1 << 2;
/// Intel OpenVINO IR format.
const OPENVINO = 1 << 3;
}
}
bitflags! {
/// Where to source backing memory for an accelerator buffer object.
pub struct AccelPlacementFlags: u32 {
/// Accelerator-local SRAM or on-package DRAM (highest bandwidth).
const ACCEL_LOCAL = 1 << 0;
/// System DRAM, coherent with CPU caches.
const SYSTEM_COHERENT = 1 << 1;
/// System DRAM, non-coherent (explicit cache flush/invalidate needed).
const SYSTEM_NONCOHERENT = 1 << 2;
}
}
Compiler model: The kernel never compiles or JIT-translates model graphs. Vendor SDKs (Qualcomm QNN SDK, Intel OpenVINO, Google XNNPACK, Arm Ethos toolchain) run entirely in userspace and produce a hardware-specific binary blob. The kernel's role is limited to loading that blob into accelerator memory, managing its lifetime, and scheduling inference jobs. This boundary keeps the attack surface small and avoids incorporating license-encumbered compiler code into the kernel.
Shared synchronization with GPU: AccelDevice uses GpuFence (§12.4)
for all completion signaling. A camera-to-inference pipeline can therefore
express its dependencies as:
camera_fence = ISP_submit(frame)
infer_fence = accel.submit_inference(model, input, output, &[camera_fence])
display_fence = compositor.atomic_commit(plane, &[infer_fence])
No additional synchronization primitive is needed.
Capability gating: CAP_ACCEL_INFERENCE gates buffer allocation, model
loading, and inference submission. CAP_ACCEL_ADMIN additionally gates TDR,
thermal policy override, and access to hardware performance counters.
Hardware-specific detail: §67.18 (Qualcomm Hexagon DSP/NPU), §67.19 (Intel Meteor Lake NPU / OpenVINO kernel driver), §67.20 (MediaTek APU), §67.21 (generic ONNX Runtime kernel backend for FPGA accelerators).
12.8 DMA Engine
Tier: Tier 1. DMA engines are platform infrastructure directly used by other Tier 1 subsystems (audio DMA in §12.3, display framebuffer DMA in §12.2, storage DMA in §8a). They must operate in kernel context to program IOMMU tables and to route completion interrupts to the correct waiters.
KABI interface name: dma_engine_v1 (in interfaces/dma_engine.kabi).
// isle-core/src/dma_engine/mod.rs — authoritative DMA engine driver contract
/// A platform DMA engine controller. Implemented by drivers for Intel CBDMA /
/// DSA, ARM PL330, Synopsys eDMA, TI UDMA, and Xilinx AXI DMA.
pub trait DmaEngine: Send + Sync {
/// Request a DMA channel from this engine. `capabilities` specifies the
/// minimum set of capabilities the channel must provide (e.g.,
/// `MEM_TO_MEM | SCATTER_GATHER`). The engine selects the best-matching
/// channel from its pool; returns `DmaError::NoChannel` if none is
/// available.
///
/// On ACPI platforms, the channel is cross-referenced to the CSRT entry
/// describing it. On DT platforms, the channel is cross-referenced to the
/// `dmas` phandle in the requesting device's DT node. See the ACPI/DT
/// enumeration note below.
fn request_channel(
&self,
capabilities: DmaChannelCapabilities,
) -> Result<DmaChannel, DmaError>;
/// Release a DMA channel back to the engine's pool. The channel must not
/// have any in-flight transactions (`DmaFence` values that have not yet
/// been signaled) when released; returns `DmaError::ChannelBusy` if so.
fn release_channel(&self, channel: DmaChannel) -> Result<(), DmaError>;
}
/// A DMA channel: a single logical DMA stream backed by one hardware channel.
pub trait DmaChannel: Send + Sync {
/// Submit a flat memory-to-memory copy of `len` bytes from physical
/// address `src_pa` to physical address `dst_pa`. Returns a `DmaFence`
/// signaled when the copy is complete.
///
/// Both addresses must be within IOMMU-mapped regions. The caller is
/// responsible for cache coherency (flush source, invalidate destination)
/// on non-coherent platforms before and after the transfer.
fn memcpy(
&self,
dst_pa: u64,
src_pa: u64,
len: usize,
) -> Result<DmaFence, DmaError>;
/// Submit a scatter-gather copy. `entries` is a list of
/// `(src_pa, dst_pa, len)` tuples. The engine processes entries in order.
/// Returns a single `DmaFence` signaled after all entries complete.
///
/// The maximum number of entries per call is bounded by
/// `DmaChannelInfo::max_sg_entries`; split across multiple calls if
/// needed.
fn sg_copy(
&self,
entries: &[(u64, u64, usize)],
) -> Result<DmaFence, DmaError>;
/// Fill `len` bytes starting at physical address `dst_pa` with the
/// repeating byte pattern `value`. Returns a `DmaFence` signaled on
/// completion. Used for zeroing newly allocated pages and clearing
/// framebuffers.
fn fill(
&self,
dst_pa: u64,
len: usize,
value: u8,
) -> Result<DmaFence, DmaError>;
/// Return static information about this channel (capabilities,
/// maximum transfer size, maximum scatter-gather entry count).
fn channel_info(&self) -> DmaChannelInfo;
}
/// A DMA completion handle. Cheap to copy; backed by a hardware status word.
#[derive(Clone, Copy)]
pub struct DmaFence {
/// Identifies the DMA engine and channel this fence belongs to.
pub channel_id: u32,
/// Sequence number on the channel's completion timeline.
pub seqno: u64,
}
impl DmaFence {
/// Poll whether this DMA transfer has completed. Returns immediately
/// without blocking. Safe to call from interrupt context.
pub fn is_done(&self) -> bool { /* driver implementation */ unimplemented!() }
/// Block the current thread until this DMA transfer completes or until
/// `timeout_ns` nanoseconds elapse. Returns `Ok(())` on completion,
/// `Err(DmaError::Timeout)` on timeout.
pub fn wait(&self, timeout_ns: u64) -> Result<(), DmaError> { /* driver implementation */ unimplemented!() }
}
/// Static information about a DMA channel.
pub struct DmaChannelInfo {
/// Capabilities of this specific channel.
pub capabilities: DmaChannelCapabilities,
/// Maximum number of bytes per single `memcpy` or `fill` call.
pub max_transfer_bytes: usize,
/// Maximum number of scatter-gather entries per `sg_copy` call.
pub max_sg_entries: u32,
/// Whether this channel's transfers are observable by the CPU without
/// an explicit cache flush (i.e., the DMA path is cache-coherent).
pub coherent: bool,
}
bitflags! {
/// Capabilities that a DMA channel may provide.
pub struct DmaChannelCapabilities: u32 {
/// Memory-to-memory flat copy.
const MEM_TO_MEM = 1 << 0;
/// Memory-to-device transfers (device is the sink).
const MEM_TO_DEV = 1 << 1;
/// Device-to-memory transfers (device is the source).
const DEV_TO_MEM = 1 << 2;
/// Scatter-gather transfer support.
const SCATTER_GATHER = 1 << 3;
/// Memory fill (pattern write, used for zeroing).
const FILL = 1 << 4;
/// Cache-coherent DMA path (no manual flush/invalidate required).
const COHERENT = 1 << 5;
}
}
Shared infrastructure model: DmaChannel is the common abstraction for
all bulk-data DMA in ISLE. Subsystems that need DMA use it as follows:
- Audio (§12.3): the
PcmStreamDMA ring uses aMEM_TO_DEVorDEV_TO_MEMchannel obtained from the audio controller's built-in DMA or from a platform DMA engine channel bound in ACPI/DT. - Display (§12.2): cursor and framebuffer uploads on platforms without
a GPU use a
MEM_TO_DEVchannel. - Storage (§8a): on platforms where the storage controller does not
have its own scatter-gather engine,
DmaChannel::sg_copyis used for PRD tables.
On platforms where the device has its own built-in DMA (NVMe PRPs, AHCI PRDT,
PCIe DMA engines on GPUs), the device driver does not use DmaEngine at all;
the built-in DMA is programmed directly and the completion is reported via
the device's own interrupt.
ACPI/DT enumeration: On ACPI platforms, DMA engine channels are described
in the ACPI CSRT (Core System Resources Table, documented in the ACPI
specification §5.2.24). The kernel's ACPI layer parses CSRT at boot and
registers each channel group as a DmaEngine instance. Consumers reference
channels by ACPI _CRS DMA descriptor. On Device Tree platforms, channels are
described using the dmas and dma-names properties in the consuming device
node, following the DMA Engine binding in the Linux kernel DT bindings (used
as the authoritative reference for this property format).
Hardware-specific detail: §67.22 (Intel CBDMA / DSA), §67.23 (ARM PL330), §67.24 (TI UDMA-P on AM65x/J7), §67.25 (Synopsys eDMA on PCIe controllers).
12.9 GPIO and Pin Control
Tier: Tier 1. GPIO controllers are low-level platform hardware directly used by many other Tier 1 drivers for chip-select lines, reset/enable signals, and interrupt routing. GPIO interrupts must be demultiplexed in the kernel IRQ subsystem (§3) before they can be delivered to drivers or (via eventfd) to userspace.
KABI interface names: gpio_controller_v1, pinctrl_v1 (in
interfaces/gpio.kabi).
// isle-core/src/gpio/mod.rs — authoritative GPIO and pin control contract
/// A GPIO controller. One instance per hardware GPIO IP block (which may
/// expose dozens to hundreds of individual lines). Implemented by drivers for
/// Intel Broxton/Cannon Lake PCH GPIO, ARM PL061, NXP RGPIO, Qualcomm TLMM,
/// and Broadcom BCM2835 GPIO.
pub trait GpioController: Send + Sync {
// --- Pin configuration ---
/// Configure a GPIO line's direction, pull resistor, and drive mode.
/// Must be called before `read` or `write` on the line.
fn configure(
&self,
line: GpioLine,
direction: GpioDirection,
pull: GpioPull,
drive: GpioDrive,
) -> Result<(), GpioError>;
// --- Digital I/O ---
/// Read the current logic level of an input (or output in read-back mode)
/// GPIO line. Returns `true` for high, `false` for low. Returns
/// `GpioError::NotInput` if the line is configured as output-only and the
/// hardware does not support output read-back.
fn read(&self, line: GpioLine) -> Result<bool, GpioError>;
/// Set the output level of an output-configured GPIO line. `high` = true
/// drives the line high; `high` = false drives it low. Returns
/// `GpioError::NotOutput` if the line is configured as input.
fn write(&self, line: GpioLine, high: bool) -> Result<(), GpioError>;
// --- Interrupt registration ---
/// Register an interrupt handler for a GPIO line. `mode` selects the
/// edge or level trigger condition. `handler` is called in a Tier 1
/// threaded interrupt context (§3 threaded IRQ model). Returns a
/// `GpioIrqHandle`; dropping the handle atomically deregisters the
/// handler and ensures no further invocations occur.
///
/// Only one handler may be registered per line at a time; returns
/// `GpioError::AlreadyRegistered` if a handler is already registered.
fn request_irq(
&self,
line: GpioLine,
mode: IrqMode,
handler: GpioIrqHandler,
) -> Result<GpioIrqHandle, GpioError>;
/// Deregister the interrupt handler associated with `handle`. Equivalent
/// to dropping the `GpioIrqHandle` but provides an explicit error return.
fn free_irq(&self, handle: GpioIrqHandle) -> Result<(), GpioError>;
// --- Controller metadata ---
/// Return the number of GPIO lines managed by this controller.
fn line_count(&self) -> u32;
/// Return the controller's unique identifier (used to construct
/// `GpioLine` handles for cross-subsystem use).
fn controller_id(&self) -> u32;
}
/// A pin control block. Manages the per-pin function multiplexer on SoCs
/// where physical pads can be assigned to multiple peripheral signals
/// (GPIO, I2C, SPI, UART, PCIe reference clock, etc.).
///
/// On platforms where pin multiplexing is co-located inside the GPIO
/// controller, both traits are implemented by the same driver struct.
pub trait PinCtrl: Send + Sync {
/// Query the list of functions available for a given pin index. Returns a
/// list of `PinFunction` values, each with a name (e.g., "gpio",
/// "i2c_sda", "spi_clk", "uart_tx") and the peripheral it routes to.
fn query_functions(&self, pin: u32) -> Result<Vec<PinFunction>, PinCtrlError>;
/// Select a function for a pin, connecting the physical pad to the
/// named peripheral signal. Any previously selected function is
/// deactivated. Returns `PinCtrlError::Conflict` if another driver has
/// claimed this pin in an incompatible function.
fn select_function(
&self,
pin: u32,
function: &PinFunction,
) -> Result<(), PinCtrlError>;
/// Release ownership of a pin, returning it to a default high-impedance
/// state. Safe to call even if no function is currently selected.
fn release_pin(&self, pin: u32) -> Result<(), PinCtrlError>;
}
/// Handle to a single GPIO line: the combination of a controller and a
/// zero-based pin index within that controller.
///
/// This type is referenced by §11.3 (I2C-HID interrupt line) and §12.3
/// (audio jack detection) and is formally defined here. All other subsystems
/// that reference a GPIO line MUST use this type.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct GpioLine {
/// Identifier of the `GpioController` that owns this line.
pub controller_id: u32,
/// Zero-based index of the line within the controller (0 …
/// `controller.line_count() - 1`).
pub pin_index: u32,
}
/// RAII handle for a registered GPIO interrupt. Dropping this value
/// deregisters the handler. Implemented as a token that the kernel associates
/// with the registration record; no raw pointers are exposed.
pub struct GpioIrqHandle {
/// Opaque kernel handle. The kernel uses this to locate and remove the
/// registration entry on drop.
pub(crate) handle: u64,
}
impl Drop for GpioIrqHandle {
/// Deregister the GPIO interrupt handler. Guaranteed to be called even
/// if the owning driver panics, preventing stale handlers from firing
/// after the driver struct is freed.
fn drop(&mut self) { /* kernel deregistration via syscall or direct call */ }
}
/// Type alias for a GPIO interrupt handler function pointer.
/// The handler is called in a threaded interrupt context (§3). It must not
/// block indefinitely; it may acquire short-duration spinlocks and queue
/// work to a kernel work queue.
pub type GpioIrqHandler = fn(line: GpioLine, mode: IrqMode);
/// Available trigger modes for GPIO interrupts.
#[repr(u32)]
pub enum IrqMode {
/// Trigger on a low-to-high transition.
RisingEdge = 0,
/// Trigger on a high-to-low transition.
FallingEdge = 1,
/// Trigger on both transitions.
BothEdges = 2,
/// Trigger while the line is held high (level-triggered).
HighLevel = 3,
/// Trigger while the line is held low (level-triggered).
LowLevel = 4,
}
#[repr(u32)]
/// GPIO line direction.
pub enum GpioDirection {
/// Line is an input; the driver reads the external logic level.
Input = 0,
/// Line is an output; the driver drives the logic level.
Output = 1,
}
#[repr(u32)]
/// Internal pull resistor configuration.
pub enum GpioPull {
/// No pull resistor (high impedance when not driven).
None = 0,
/// Weak pull-up to VCC.
PullUp = 1,
/// Weak pull-down to GND.
PullDown = 2,
}
#[repr(u32)]
/// Output drive mode.
pub enum GpioDrive {
/// Totem-pole (push-pull): the driver actively drives both high and low.
PushPull = 0,
/// Open-drain: the driver only pulls low; high is achieved by an external
/// pull-up. Required for I2C bus lines and wired-AND configurations.
OpenDrain = 1,
}
/// A multiplexable function available on a SoC pin.
pub struct PinFunction {
/// Human-readable function name (e.g., "gpio", "i2c0_sda", "uart2_tx").
pub name: &'static str,
/// The peripheral subsystem this function connects to (e.g., I2C
/// controller index 0, UART controller index 2).
pub peripheral_id: u32,
}
IRQ model: request_irq() registers the handler with the kernel IRQ
subsystem (§3). The GPIO controller's top-level interrupt line is demuxed by
the GPIO driver: on each top-level interrupt, the driver reads the controller's
pending interrupt register, identifies which lines are active, and dispatches
the registered handlers for those lines in threaded IRQ context. Handlers run
at a normal kernel thread priority with preemption enabled unless the handler
explicitly raises its priority via the §3 scheduling API.
ACPI/DT enumeration: On ACPI platforms, GPIO lines are described using
GpioInt (interrupt) and GpioIo (I/O) resource descriptors in device
_CRS methods, following the ACPI specification §19.6.56–19.6.57. The GPIO
subsystem resolves these descriptors to GpioLine handles and registers IRQs
automatically during device enumeration. On Device Tree platforms, the
gpios phandle-with-args property and the standard GPIO binding (two-cell
format: <&gpio_controller pin_index flags>) are parsed to produce GpioLine
handles.
Fix to §11.3: The GpioLine type and request_irq() method used by the
I2C-HID driver (§11.3) to register the ATTN interrupt are formally defined by
this §12.9 contract. The §11.3 description is authoritative on how I2C-HID
uses GPIO; this section is authoritative on what GPIO provides.
Hardware-specific detail: §67.26 (Intel PCH GPIO — Broxton, Cannon Lake, Tiger Lake pinctrl), §67.27 (ARM PL061), §67.28 (Qualcomm TLMM), §67.29 (NXP i.MX IOMUXC + GPIO), §67.30 (Broadcom BCM2835/2711 GPIO).
12.10 Crypto Accelerator
Tier: Tier 1. Hardware crypto engines need DMA access to key material and plaintext/ciphertext buffers. Tier 2 boundary crossing would add one to two microseconds per operation — unacceptable for TLS session establishment (RSA or ECDH operations on the critical path of every connection) and bulk record encryption (AES-GCM on every TCP segment with TLS offload).
KABI interface name: crypto_engine_v1 (in interfaces/crypto_engine.kabi).
// isle-core/src/crypto_engine/mod.rs — authoritative crypto accelerator contract
/// A hardware cryptographic accelerator. Implemented by drivers for:
/// - On-SoC crypto engines (Intel QAT, ARM TrustZone CryptoCell, NXP CAAM)
/// - NIC-integrated TLS offload engines (Mellanox ConnectX-6 TLS)
/// - HSM-adjacent secure enclaves
/// - Software fallback (when no hardware engine is present)
pub trait CryptoEngine: Send + Sync {
// --- Capability discovery ---
/// Return the list of algorithm configurations supported by this engine.
/// Each entry specifies the algorithm family, key sizes, and performance
/// tier (hardware-accelerated or software fallback). The caller selects
/// an algorithm from this list when creating a session.
fn query_algorithms(&self) -> Vec<AlgorithmDescriptor>;
// --- Key management ---
/// Import raw key material into the engine under a wrapping key (or in
/// plaintext if `wrapping_key` is None and the engine permits it). The
/// engine stores the key internally; the caller's buffer is zeroed after
/// import. Returns an opaque `KeyHandle`. Raw key bytes are never
/// accessible after this call; all subsequent operations use the handle.
///
/// `flags` controls whether the key may be exported (wrapped) later or
/// is permanently non-extractable. Non-extractable keys cannot leave the
/// hardware even if the kernel is fully compromised — the hardware
/// enforces this at the engine level.
///
/// Requires `CAP_CRYPTO_ADMIN`.
fn import_key(
&self,
algorithm: AlgorithmId,
key_bytes: &[u8],
wrapping_key: Option<&KeyHandle>,
flags: KeyFlags,
) -> Result<KeyHandle, CryptoError>;
/// Export a key that was imported with `EXPORTABLE`. The key material is
/// encrypted under `wrapping_key` and returned as an opaque blob. Returns
/// `CryptoError::NonExtractable` if the key was imported with
/// `NON_EXTRACTABLE`.
///
/// Requires `CAP_CRYPTO_ADMIN`.
fn export_key(
&self,
key: &KeyHandle,
wrapping_key: &KeyHandle,
) -> Result<Vec<u8>, CryptoError>;
/// Destroy a key handle. The engine erases all key material. After this
/// call, the `KeyHandle` is invalid and any session using it will return
/// `CryptoError::InvalidKey` on the next operation.
///
/// Requires `CAP_CRYPTO_ADMIN`.
fn destroy_key(&self, key: KeyHandle) -> Result<(), CryptoError>;
// --- Session lifecycle ---
/// Allocate a session for a specific algorithm and key. A session holds
/// per-operation state (IV/nonce counters, HMAC state, RSA blinding
/// factors) and is bound to one `KeyHandle`. Sessions are not thread-safe;
/// concurrent callers must allocate separate sessions.
///
/// Requires `CAP_CRYPTO_ACCEL`.
fn alloc_session(
&self,
algorithm: AlgorithmId,
key: &KeyHandle,
) -> Result<CryptoSession, CryptoError>;
/// Free a session. Any in-flight operation on this session must complete
/// first; returns `CryptoError::SessionBusy` if not.
fn free_session(&self, session: CryptoSession) -> Result<(), CryptoError>;
// --- Operation submission ---
/// Submit a cryptographic operation. The request is placed on the
/// engine's ring buffer (§8a ring model). Returns immediately with a
/// `GpuFence` (§12.4 timeline semaphore) that is signaled when the
/// output DMA-BUF has been fully written and is safe to read.
///
/// `request` fully describes the operation: op type (encrypt, decrypt,
/// sign, verify, hash, key-exchange), input DMA-BUF, output DMA-BUF,
/// associated data (for AEAD ciphers), nonce/IV, and tag buffer.
///
/// For AEAD operations (AES-GCM, ChaCha20-Poly1305): authentication tag
/// is appended to ciphertext on encrypt, and verified + stripped on
/// decrypt. A tag mismatch on decrypt returns `CryptoError::AuthFailed`
/// via the fence's error status.
///
/// Requires `CAP_CRYPTO_ACCEL`.
fn submit(
&self,
session: &CryptoSession,
request: &CryptoRequest,
) -> Result<GpuFence, CryptoError>;
}
/// An opaque key handle. The raw key material is inaccessible after
/// `import_key`; this handle is the only means to reference the key in
/// subsequent operations.
pub struct KeyHandle {
/// Opaque engine-assigned key identifier.
pub(crate) id: u64,
/// The algorithm this key is bound to.
pub algorithm: AlgorithmId,
/// Whether this key is allowed to be exported.
pub exportable: bool,
}
/// A crypto session: per-operation state bound to one key and algorithm.
pub struct CryptoSession {
/// Opaque kernel handle.
pub handle: u64,
/// Algorithm this session is configured for.
pub algorithm: AlgorithmId,
}
/// A single cryptographic operation request, placed on the engine's ring.
#[repr(C)]
pub struct CryptoRequest {
/// The type of operation to perform.
pub op: CryptoOp,
/// DMA-BUF containing input data (plaintext for encrypt, ciphertext for
/// decrypt, message for hash/sign, public value for key agreement).
pub input: DmaBufHandle,
/// DMA-BUF that the engine will write output into (ciphertext for
/// encrypt, plaintext for decrypt, digest for hash, signature for sign,
/// shared secret for key agreement).
pub output: DmaBufHandle,
/// Associated data for AEAD operations (authenticated but not encrypted).
/// Length zero means no associated data.
pub aad: DmaBufHandle,
/// Nonce or IV for symmetric ciphers. Length and format are
/// algorithm-specific: 12 bytes for AES-GCM, 12 bytes for
/// ChaCha20-Poly1305. Ignored for hash and asymmetric operations.
pub nonce: [u8; 16],
/// Actual nonce/IV length in bytes (0 if not applicable).
pub nonce_len: u8,
/// Input data length in bytes.
pub input_len: usize,
/// Associated data length in bytes.
pub aad_len: usize,
}
/// The specific cryptographic operation requested.
#[repr(u32)]
pub enum CryptoOp {
/// Symmetric encryption (AES-GCM, ChaCha20-Poly1305, AES-CBC, AES-CTR).
Encrypt = 0,
/// Symmetric decryption with authentication tag verification (AEAD) or
/// plain decryption (non-AEAD).
Decrypt = 1,
/// Compute a message digest (SHA-256, SHA-384, SHA-512, SHA-3-256).
Hash = 2,
/// Compute an HMAC (HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512).
Hmac = 3,
/// Asymmetric signing (RSA-PSS, ECDSA P-256, ECDSA P-384, Ed25519).
Sign = 4,
/// Asymmetric signature verification.
Verify = 5,
/// Key agreement / scalar multiplication (ECDH P-256, ECDH P-384,
/// X25519, X448). Output is the shared secret.
KeyAgreement = 6,
/// TLS record encryption (NIC TLS offload engines only). Input is a
/// plaintext TLS record; output is the encrypted wire-format record.
TlsRecordEncrypt = 7,
/// TLS record decryption (NIC TLS offload engines only).
TlsRecordDecrypt = 8,
}
/// Identifier for a specific algorithm configuration.
#[repr(u32)]
pub enum AlgorithmId {
AesGcm128 = 0,
AesGcm256 = 1,
ChaCha20Poly1305 = 2,
AesCbc128 = 3,
AesCbc256 = 4,
AesCtr128 = 5,
AesCtr256 = 6,
Sha256 = 16,
Sha384 = 17,
Sha512 = 18,
Sha3_256 = 19,
HmacSha256 = 32,
HmacSha384 = 33,
HmacSha512 = 34,
RsaPss2048Sha256 = 48,
RsaPss4096Sha384 = 49,
EcdsaP256Sha256 = 64,
EcdsaP384Sha384 = 65,
Ed25519 = 66,
EcdhP256 = 80,
EcdhP384 = 81,
X25519 = 82,
X448 = 83,
}
bitflags! {
/// Flags controlling key lifecycle and extractability.
pub struct KeyFlags: u32 {
/// Key may be exported (wrapped) by a process holding
/// `CAP_CRYPTO_ADMIN`. Mutually exclusive with `NON_EXTRACTABLE`.
const EXPORTABLE = 1 << 0;
/// Key material never leaves the hardware security boundary. Once
/// imported, it cannot be read out even with physical access to DRAM.
/// Mutually exclusive with `EXPORTABLE`.
const NON_EXTRACTABLE = 1 << 1;
/// Key is persistent across power cycles (stored in hardware key
/// store, e.g., TPM NV index or TrustZone secure storage). Engines
/// that do not support persistence return `CryptoError::Unsupported`
/// if this flag is set.
const PERSISTENT = 1 << 2,
}
}
/// Descriptor of one algorithm configuration supported by an engine.
pub struct AlgorithmDescriptor {
/// The algorithm this descriptor covers.
pub id: AlgorithmId,
/// Whether this algorithm is executed in hardware (true) or software
/// fallback (false).
pub hardware_accelerated: bool,
/// Approximate throughput in MiB/s for bulk operations (encrypt/decrypt/
/// hash). 0 for asymmetric operations where throughput is not meaningful.
pub throughput_mibps: u32,
/// Approximate latency in microseconds per single operation (for
/// asymmetric operations such as sign/verify/key-agreement).
pub latency_us: u32,
}
Software fallback: If query_algorithms() returns a descriptor with
hardware_accelerated = false for a requested algorithm, the submit() path
executes the algorithm in a kernel software implementation (Rust aes-gcm,
chacha20poly1305, sha2, p256, x25519-dalek). The API is identical
regardless of acceleration. Callers that require hardware acceleration for
security reasons (e.g., to achieve constant-time execution or non-extractable
keys) must check the hardware_accelerated flag in the descriptor before
creating a session.
TLS offload integration: A NIC with TLS record-layer offload (e.g., Mellanox
ConnectX-6, Marvell OcteonTX2) registers itself as a CryptoEngine with
TlsRecordEncrypt and TlsRecordDecrypt in its algorithm list. The kernel
TLS layer (ktls, §net-tls) queries the CryptoEngine registry and, if a
matching engine is found for the session's cipher suite, offloads record
encryption there. The TCP send path then bypasses the software TLS layer and
passes plaintext records directly to the NIC. The kernel retains ownership of
the session key via a KeyHandle; the NIC's shadow copy is invalidated when
destroy_key is called.
Capability gating: CAP_CRYPTO_ACCEL is required for session allocation
and operation submission. CAP_CRYPTO_ADMIN is additionally required for key
import, export, and destruction, and for reading hardware performance counters.
Processes without CAP_CRYPTO_ACCEL receive the software fallback path
transparently; they do not receive an error.
Hardware-specific detail: §67.31 (Intel QAT — QuickAssist Technology), §67.32 (ARM TrustZone CryptoCell — cc712, cc713), §67.33 (NXP CAAM), §67.34 (Mellanox ConnectX TLS offload), §67.35 (TPM 2.0 key storage).
12a. Bluetooth HCI Driver
Interface contract: §12.1 (
WirelessDrivertrait covers 802.11; BT HCI uses a separate HCI socket interface exposed viaisle-compat). Tier decision: Tier 2 for the BT stack (control path not latency-sensitive), Tier 1 for the kernel HCI transport driver.
Stack Decision: BlueZ-compatible via isle-compat HCI socket interface — ISLE provides a kernel HCI (Host Controller Interface) driver that exposes /dev/hci0 as a character device implementing the standard Linux HCI socket protocol. The BlueZ userspace daemon (bluetoothd) runs in Tier 2, implementing L2CAP, SDP, RFCOMM, A2DP, HID, and pairing logic. This approach:
- Reuses the mature BlueZ stack (~200K lines, 15+ years of protocol compatibility testing).
- Avoids the multi-year effort of a clean-room Bluetooth stack.
- Maintains compatibility with existing Bluetooth management tools (bluez-utils, bluetoothctl).
12a.1 Kernel HCI Driver (Tier 1)
The HCI driver is Tier 1 (MPK-isolated) and handles raw HCI packet transport. Common transports: - USB HCI: Bulk endpoints (ACL data), interrupt endpoint (events), control endpoint (commands). Most common on laptops (Intel, Realtek, Qualcomm combo modules). - UART HCI: Serial port (ttyS, ttyUSB) with H4/H5/BCSP framing. Common on ARM SoCs (RPi, embedded).
// isle-core/src/bluetooth/hci.rs
/// HCI packet type.
#[repr(u8)]
pub enum HciPacketType {
/// HCI command (host → controller).
Command = 0x01,
/// ACL data (bidirectional, L2CAP payload).
AclData = 0x02,
/// SCO data (bidirectional, voice payload).
ScoData = 0x03,
/// HCI event (controller → host).
Event = 0x04,
}
/// HCI device handle (opaque to userspace).
#[repr(C)]
pub struct HciDeviceId(u32);
/// HCI command packet (max 259 bytes: 1 byte type + 2 bytes opcode + 1 byte len + 255 bytes data).
#[repr(C)]
pub struct HciCommand {
/// Packet type (always 0x01 for commands).
pub packet_type: u8,
/// Opcode (OCF + OGF encoded as u16).
pub opcode: u16,
/// Parameter length (0-255).
pub param_len: u8,
/// Parameters (variable length).
pub params: [u8; 255],
}
/// HCI event packet (max 258 bytes: 1 byte type + 1 byte event code + 1 byte len + 255 bytes data).
#[repr(C)]
pub struct HciEvent {
/// Packet type (always 0x04 for events).
pub packet_type: u8,
/// Event code.
pub event_code: u8,
/// Parameter length (0-255).
pub param_len: u8,
/// Parameters (variable length).
pub params: [u8; 255],
}
The HCI driver exposes a ring buffer interface (§8a.2) to the BlueZ daemon:
- Command ring: BlueZ writes HciCommand structs, driver sends them to the controller via USB bulk OUT or UART TX.
- Event ring: Driver receives events from the controller (USB interrupt IN or UART RX), writes HciEvent structs to the ring.
- ACL TX ring: BlueZ writes ACL data packets (L2CAP frames), driver sends them to the controller.
- ACL RX ring: Driver receives ACL data from the controller, writes to the ring.
12a.2 BlueZ Daemon (Tier 2)
bluetoothd runs as a Tier 2 process. It opens /dev/hci0 (which is backed by the HCI ring buffer interface via isle-compat), reads/writes HCI packets, and implements all higher-layer protocols:
- L2CAP: Logical Link Control and Adaptation Protocol (multiplexing, segmentation).
- SDP: Service Discovery Protocol (enumerate remote device capabilities).
- RFCOMM: Serial port emulation over Bluetooth (for legacy apps).
- A2DP: Advanced Audio Distribution Profile (high-quality stereo audio streaming).
- AVRCP: Audio/Video Remote Control Profile (play/pause, volume control).
- HID: Human Interface Device (keyboards, mice, game controllers).
- HSP/HFP: Headset/Hands-Free Profiles (phone call audio).
Pairing: BlueZ userspace daemon (bluetoothd) handles pairing logic and stores the pairing database. Kernel provides HCI transport only.
12a.3 A2DP Audio Routing to PipeWire
When a Bluetooth headset is paired and A2DP is active:
1. bluetoothd decodes the SBC/AAC/LDAC A2DP stream from ACL packets (received via the HCI ACL RX ring).
2. Writes decoded PCM samples to a PipeWire ring buffer (§15/§61, the same ring buffers used for wired audio).
3. PipeWire mixes/routes the audio to the audio subsystem (§61.3, 15-user-io.md).
4. For playback, the reverse path: PipeWire writes PCM to a ring, bluetoothd encodes to SBC/AAC, sends via ACL TX.
Latency: A2DP adds ~100-200ms latency (codec encoding/decoding, BT scheduling). This is unavoidable (Bluetooth spec limitation). Gaming audio and video calls use SCO (Synchronous Connection-Oriented) links for lower latency at the cost of lower quality (64kbps, 8kHz sample rate).
12a.4 HID Input Routing
When a Bluetooth keyboard/mouse is paired:
1. bluetoothd receives HID reports via L2CAP over ACL.
2. Translates them to standard InputEvent structs (§15, same format as USB HID).
3. Writes to the input subsystem ring buffer (§15).
4. isle-input (the input multiplexer) routes events to the active Wayland compositor or VT.
Wake-on-Bluetooth: Before S3 suspend, bluetoothd tells the HCI driver to enable "wake on HID activity" (any HID report from a paired device wakes the system). The driver programs the USB controller's PME mask or UART's RTS line to wake on RX. Pressing a key on the Bluetooth keyboard wakes the laptop.
12a.5 Architectural Decision
Bluetooth: BlueZ-compatible via isle-compat HCI
Decision: Kernel HCI driver (Tier 1) exposes /dev/hci0. BlueZ daemon (Tier 2) implements L2CAP, A2DP, HID, pairing. Reuses mature BlueZ stack (~200K lines, 15+ years of testing) instead of multi-year clean-room effort. ISLE maintains Linux HCI ABI compatibility.
12b. WiFi Driver
Interface contract: §12.1 (
WirelessDrivertrait,wireless_device_v1KABI). This section specifies the Intel/Realtek/Qualcomm/MediaTek/Broadcom implementations of that contract. Tier and ring-buffer design decisions are authoritative in §12.1.
Tier: Tier 1 (per §12.1 — latency-sensitive; IOMMU-bounded firmware threat model).
Chipset coverage (minimum for launch): - Intel: AX210, AX211, AX411 (WiFi 6E) - Realtek: RTL8852AE, RTL8852BE, RTL8852CE (common in consumer laptops) - Qualcomm: QCA6390, QCA6391, WCN6855 (Snapdragon-based laptops) - Mediatek: MT7921, MT7922 (budget laptops) - Broadcom: BCM4350, BCM4352 (older MacBooks, some Thinkpads)
12b.1 WiFi Driver Architecture
WiFi drivers implement the WirelessDriver trait defined in §12.1 (wireless_device_v1 KABI). Each chipset driver is Tier 1, MPK-isolated on x86-64 (§6), and communicates with isle-net via the TX/RX ring buffers specified in §12.1.
// isle-core/src/net/wireless.rs
/// WiFi device handle. Opaque to userspace, used for ioctl operations.
#[repr(C)]
pub struct WirelessDeviceId(u64);
/// WiFi scan result.
#[repr(C)]
pub struct WifiScanResult {
/// BSSID (MAC address of the AP).
pub bssid: [u8; 6],
/// SSID length (0-32 bytes).
pub ssid_len: u8,
/// SSID (variable length, up to 32 bytes; remaining bytes are zero).
pub ssid: [u8; 32],
/// RSSI (signal strength in dBm, typically -100 to 0).
pub rssi: i8,
/// Channel number (1-14 for 2.4 GHz, 36-165 for 5 GHz, 1-233 for 6 GHz).
pub channel: u16,
/// Security type (bitmask: WPA2=0x1, WPA3=0x2, Enterprise=0x4).
pub security: u32,
/// BSS load (0-255, indicates AP congestion; 255 = unknown).
pub bss_load: u8,
_pad: [u8; 3],
}
/// WiFi connection parameters.
#[repr(C)]
pub struct WifiConnectParams {
/// SSID length (1-32 bytes).
pub ssid_len: u8,
/// SSID.
pub ssid: [u8; 32],
/// BSSID (all zeros = any BSSID; specific BSSID = forced roam to that AP).
pub bssid: [u8; 6],
/// Security type (WPA2=0x1, WPA3=0x2, Enterprise=0x4).
pub security: u32,
/// PSK (pre-shared key) length in bytes (0 for open networks).
pub psk_len: u8,
/// PSK (for WPA2-PSK / WPA3-SAE personal).
pub psk: [u8; 64],
/// 802.1X parameters (for enterprise; zero if not used).
pub eap: Eap8021xParams,
}
/// 802.1X / EAP parameters for enterprise WiFi.
#[repr(C)]
pub struct Eap8021xParams {
/// EAP method (0=none, 1=PEAP, 2=TTLS, 3=TLS).
pub method: u8,
/// Identity length.
pub identity_len: u8,
/// Identity (username).
pub identity: [u8; 128],
/// Password length (0 for certificate-based).
pub password_len: u8,
/// Password (for PEAP/TTLS).
pub password: [u8; 128],
/// CA certificate handle (for TLS verification; 0 = no pinning).
pub ca_cert: u64,
_pad: [u8; 6],
}
/// WiFi power save mode.
#[repr(u32)]
pub enum WifiPowerSaveMode {
/// No power save (CAM - Constantly Awake Mode). Lowest latency, highest power.
Disabled = 0,
/// 802.11 Power Save Mode (PSM). Sleep between beacons, wake for DTIM.
Enabled = 1,
/// Aggressive power save (skip DTIMs, rely on TIM). Highest battery savings.
Aggressive = 2,
}
/// WiFi connection state.
#[repr(u32)]
pub enum WifiState {
/// Not connected, not scanning.
Idle = 0,
/// Scanning for networks.
Scanning = 1,
/// Authenticating with AP (4-way handshake in progress).
Authenticating = 2,
/// Connected, link up.
Connected = 3,
/// Disconnecting (deauth sent, waiting for confirmation).
Disconnecting = 4,
}
/// WiFi statistics.
#[repr(C)]
pub struct WifiStats {
/// Current state.
pub state: WifiState,
/// Connected SSID length (0 if not connected).
pub ssid_len: u8,
/// Connected SSID.
pub ssid: [u8; 32],
/// Connected BSSID (all zeros if not connected).
pub bssid: [u8; 6],
/// Current channel.
pub channel: u16,
/// RSSI (dBm).
pub rssi: i8,
/// Link speed (Mbps).
pub link_speed_mbps: u16,
/// TX packets.
pub tx_packets: u64,
/// RX packets.
pub rx_packets: u64,
/// TX bytes.
pub tx_bytes: u64,
/// RX bytes.
pub rx_bytes: u64,
/// TX errors (failed transmissions).
pub tx_errors: u32,
/// RX errors (FCS errors, drops).
pub rx_errors: u32,
}
12b.2 Firmware Isolation Model
WiFi firmware runs on the chip (Intel AX210's embedded ARM core, Qualcomm's dedicated DSP), not in host CPU Ring 0. The Tier 1 driver manages:
- Firmware upload: Firmware blobs loaded from
/System/Firmware/WiFi/<vendor>/<chip>.binat driver probe time viaisle_driver_firmware_load()KABI call (maps the blob DMA-accessible, issues chip-specific firmware load command). - Control path: Commands (scan, connect, disconnect) sent via MMIO registers or command rings (chip-specific).
- Data path: TX/RX ring buffers (see §12b.3) populated by driver, consumed/produced by firmware DMA engine.
IOMMU enforcement: The WiFi chip's DMA is bounded to: - TX ring buffer pages (read-only from chip perspective) - RX ring buffer pages (write-only from chip perspective) - Firmware upload buffer (read-only, unmapped after upload completes)
The driver cannot access arbitrary physical memory, and the firmware cannot DMA outside its assigned buffers. This matches the NVMe threat model (§7): firmware is untrusted, IOMMU is the hard boundary.
Firmware blob loading: Firmware is NOT shipped in the kernel binary (bloat, licensing). /System/Firmware/ is a separate partition or directory populated during install. The kernel provides isle_driver_firmware_load(device_id, "iwlwifi-ax210-v71.ucode") which:
1. Reads the file from the firmware partition (uses VFS, Tier 1 filesystem driver).
2. Allocates an IOMMU-fenced DMA buffer.
3. Copies the firmware blob to the buffer.
4. Returns a DmaBufferHandle to the driver.
5. Driver passes the handle to the chip's firmware loader.
12b.3 TX/RX Ring Buffer Design
WiFi uses the same ring buffer protocol as NVMe (§8a.2). The driver allocates two rings:
- TX ring: Host writes packet descriptors (packet buffer address, length, metadata). Firmware DMA engine reads descriptors, fetches packets, transmits over the air.
- RX ring: Firmware DMA engine writes received packet descriptors (packet buffer address, length, RSSI, channel, timestamp). Host reads descriptors, processes packets.
// isle-driver-sdk/src/wireless.rs
/// WiFi TX descriptor (64 bytes, cache-line aligned).
#[repr(C, align(64))]
pub struct WifiTxDescriptor {
/// Physical address of packet buffer (DMA-mapped).
pub buffer_addr: u64,
/// Packet length in bytes (14-2304 for 802.11).
pub length: u16,
/// TX flags (ACK required, QoS TID, encryption).
pub flags: u16,
/// Sequence number (for retransmissions).
pub seq: u16,
/// Retry count (0 for first attempt).
pub retries: u8,
/// TX power (dBm, or 0xFF for default).
pub tx_power: i8,
/// Rate index (driver-specific rate table).
pub rate_index: u8,
_pad: [u8; 47],
}
/// WiFi RX descriptor (64 bytes, cache-line aligned).
#[repr(C, align(64))]
pub struct WifiRxDescriptor {
/// Physical address of packet buffer (firmware wrote packet here).
pub buffer_addr: u64,
/// Packet length in bytes.
pub length: u16,
/// RX flags (FCS OK, decryption OK, AMPDU).
pub flags: u16,
/// RSSI (dBm).
pub rssi: i8,
/// Noise floor (dBm).
pub noise: i8,
/// Channel number.
pub channel: u16,
/// Timestamp (hardware TSF, microseconds).
pub timestamp_us: u64,
_pad: [u8; 42],
}
Zero-copy path: When isle-net (the Tier 1 network stack) needs to send a packet over WiFi:
1. isle-net allocates a packet buffer from the DMA-capable memory pool (§5.4 isle_driver_dma_alloc).
2. Writes the 802.11 frame (header + payload) to the buffer.
3. Writes a WifiTxDescriptor to the TX ring.
4. Kicks the firmware (MMIO doorbell write).
5. Firmware DMA-reads the descriptor, DMA-reads the packet, transmits.
6. Firmware writes a completion entry to the TX completion ring (separate ring, omitted here for brevity; same pattern as NVMe).
12b.4 Power Management
WiFi power management integrates with §14a power budgeting and §14a.11 suspend/resume.
Power save modes:
- WifiPowerSaveMode::Disabled: Driver keeps the radio in CAM (Constantly Awake Mode). Lowest latency, ~1.5W idle power.
- WifiPowerSaveMode::Enabled: Driver enables 802.11 PSM. Radio sleeps between beacons, wakes for DTIM. ~300mW idle power, ~10-20ms wake latency.
- WifiPowerSaveMode::Aggressive: Driver enables DTIM skipping (only wake every 3rd DTIM), beacon filtering (hardware drops beacons not containing traffic indication). ~150mW idle power, ~50-100ms wake latency.
Mode selection: Controlled by the power profile (§14a.10): - Performance profile: Disabled - Balanced profile: Enabled - BatterySaver profile: Aggressive
Fast wake: When the radio is in PSM and an outbound packet arrives, the driver: 1. Immediately sends a null data frame with PM=0 (告诉 AP "I'm awake now"). 2. Queues the outbound packet in the TX ring. 3. The firmware buffers it until the AP acknowledges the PM=0 frame (~5-10ms). 4. Then transmits the queued packet.
12b.5 WoWLAN (Wake-on-WLAN)
Before entering S3 suspend (§14a.11), the driver registers wake patterns with the firmware: - Magic Packet: Wake on receiving a packet with destination MAC matching the WiFi interface. - Disconnect: Wake on AP deauth/disassoc (lost connection). - GTK Rekey: Wake on WPA2 group key rekey (maintains encryption sync).
The firmware remains powered (in D3hot, not D3cold) during S3. When a wake pattern matches, the firmware asserts the PCIe PME (Power Management Event) signal, waking the system. The driver's resume() callback (§14a.11) re-establishes the connection.
Security consideration: WoWLAN patterns are capability-gated. Only processes with CAP_NET_ADMIN can configure wake patterns, preventing DoS (malicious process sets "wake on any packet" → battery drain).
12b.6 Scan Offload
The driver supports background scanning while suspended (S0ix Modern Standby, §14a.11): 1. Before S0ix entry, the driver programs the firmware with a scan schedule (every 30 seconds, channels 1/6/11 only, passive scan). 2. Firmware performs scans autonomously while the host CPU is in C10 (powered down). 3. If scan results differ significantly (RSSI drop >20dB, AP disappeared), firmware wakes the host via PME. 4. Driver's resume handler evaluates roaming decision.
This enables "instant reconnect" on lid open: the firmware already scanned for APs and selected the best candidate while the laptop was asleep.
12b.7 Roaming
When the driver detects poor link quality (RSSI < -75dBm, packet loss >5%), it triggers a roam: 1. Background scan for APs on the same SSID. 2. Select best candidate (highest RSSI, lowest BSS load). 3. Send reassociation request to the new AP. 4. If successful, TX/RX rings continue using the same buffers (no data plane disruption). 5. If failed, stay connected to the current AP, retry roam in 5 seconds.
Seamless roaming: The driver batches the last ~10 outbound packets in a shadow buffer during reassociation. If roaming succeeds, retransmits them to the new AP. If roaming fails, discards them (they're already lost). This avoids TCP connection resets during roaming.
12b.8 Architectural Decision: WiFi Tier Classification
Decision: WiFi drivers are Tier 1 (in-kernel, isolation-domain-sandboxed).
Rationale: Tier 2 (separate process) would add ~200–500 cycles of IPC overhead per packet on the hot RX path. WiFi is latency-sensitive: video calls, SSH sessions, and cloud gaming are all affected by millisecond-scale jitter. WiFi firmware runs on-chip (AX210's embedded ARM core, Qualcomm's DSP) — not on the host CPU — so Tier 1 does not mean "trust the firmware"; IOMMU enforcement is the hard boundary, matching the NVMe threat model (§7). Tier 2 would add latency without improving isolation.
13a. Camera and Video Capture
Current state: Not covered.
Requirements:
- Webcam drivers:
- UVC (USB Video Class) — most common
- MIPI CSI-2 (for ARM SoCs, integrated cameras)
-
Vendor-specific protocols (some laptops have custom cameras)
-
Video capture API:
- V4L2 (Video4Linux2) compatibility OR clean ISLE API?
- Pixel formats (YUYV, NV12, MJPEG, H.264)
- Resolution enumeration
-
Frame rate control
-
Privacy:
- Camera privacy shutter (physical or electronic)
- Indicator LED control (show when camera is active)
- Per-app camera access control (capability-based)
Tier classification: Tier 1 with strict isolation (webcam compromise must not escalate)
13b. Printers and Scanners
Current state: Not covered.
Requirements:
- Printing:
- CUPS (Common Unix Printing System) compatibility
- IPP (Internet Printing Protocol)
- Driverless printing (IPP Everywhere, AirPrint)
-
Legacy printer drivers (HPLIP, Gutenprint)
-
Scanning:
- SANE (Scanner Access Now Easy) compatibility
- Network scanners (eSCL, WSD)
Priority: Low (many users rely on network printing, driverless IPP)