Skip to content

ISLE Architecture Design Document

Canonical reference for all development. This document defines the complete architecture of ISLE. All implementation decisions must trace back to this specification.

The architecture is split across 17 files, one per Part. Section numbers (§1–66, §78–92, Appendices A–E) are stable identifiers used for cross-references.


Master Index

Note on section numbering: Section numbers (§1–§92) are stable identifiers used across all cross-references. They do not reflect reading order. Parts II and III were reordered so that kernel-core foundations precede the driver architecture that depends on them, but section numbers were preserved to avoid breaking ~300 cross-references.

Sections Part File
§1–4 I: Vision and Foundations README.md
§10–16 II: Core Kernel Subsystems 02-kernel-core.md
§5–9 III: Driver Architecture 03-drivers.md
§17–19 IV: Boot and Hardware 04-boot-hardware.md
§20–21 V: Linux Compatibility 05-linux-compat.md
§22–26 VI: Security and Integrity 06-security.md
§27–33 VII: Storage and Filesystems 07-storage.md
§34–36 VIII: Networking 08-networking.md
§37–38 IX: Virtualization 09-virtualization.md
§39–41 X: Observability and Diagnostics 10-observability.md
§42–46 XI: AI/ML and Accelerators 11-accelerators.md
§47–48 XII: Distributed Systems 12-distributed.md
§49–54 XIII: Advanced Capabilities 13-advanced.md
§55–58, App A–E XIV: Strategy and Roadmap 14-roadmap.md
§59–62 XV: User I/O and Human Interfaces 15-user-io.md
§63–66 XVI: Containerization and IPC 16-containers.md
§67–67 XVII: (to be defined)
§78–92 XVIII: Agentic Development Timeline 18-agentic-timeline.md

Detailed Table of Contents

Part I: Vision and Foundationsthis file

  1. Overview and Philosophy
  2. Three-Tier Protection Model
  3. Isolation Mechanisms: MPK and Alternatives
  4. Performance Budget

Part II: Core Kernel Subsystems02-kernel-core.md

  1. Concurrency Model
  2. Security Architecture 11a. Error Handling and Fault Containment
  3. Memory Management 12a. Process and Task Management
  4. Memory Compression Tier
  5. Scheduler
  6. CPU Bandwidth Guarantees
  7. Real-Time Guarantees 16a. Timekeeping and Clock Management

Part III: Driver Architecture03-drivers.md

  1. Driver Model and Stable ABI (KABI)
  2. Driver Isolation Tiers
  3. Device Registry and Bus Management
  4. Zero-Copy I/O Path 8a. IPC Architecture and Message Passing
  5. Crash Recovery and State Preservation

Part IV: Boot and Hardware04-boot-hardware.md

  1. Boot and Installation
  2. First-Class Architectures
  3. Hardware Memory Safety

Part V: Linux Compatibility05-linux-compat.md

  1. Syscall Interface 20a. Futex and Userspace Synchronization
  2. Deliberately Dropped Compatibility 21a. ISLE Native Syscall Interface

Part VI: Security and Integrity06-security.md

  1. Verified Boot Chain
  2. TPM Runtime Services
  3. Runtime Integrity Measurement (IMA)
  4. Post-Quantum Cryptography
  5. Confidential Computing

Part VII: Storage and Filesystems07-storage.md

  1. Durability Guarantees 27a. Virtual Filesystem Layer
  2. ZFS Integration
  3. Block I/O and Volume Management
  4. Block Storage Networking
  5. Clustered Filesystems 31a. Distributed Lock Manager
  6. Persistent Memory
  7. Computational Storage

Part VIII: Networking08-networking.md

  1. TCP Stack Extensibility
  2. Network Overlay and Tunneling
  3. Network Interface Naming

Part IX: Virtualization09-virtualization.md

  1. Host and Guest Integration
  2. Suspend and Resume

Part X: Observability and Diagnostics10-observability.md

  1. Fault Management Architecture
  2. Stable Tracepoint ABI 40a. Debugging and Process Inspection
  3. Unified Object Namespace

Part XI: AI/ML and Accelerators11-accelerators.md

  1. Unified Accelerator Framework
  2. Accelerator Memory and P2P DMA
  3. Accelerator Isolation and Scheduling
  4. In-Kernel Inference Engine
  5. Accelerator Networking, RDMA, and Linux GPU Compatibility

Part XII: Distributed Systems12-distributed.md

  1. Distributed Kernel Architecture
    • 47.2.2 Device-local kernels as cluster members (multikernel model, Paths A/B/C, near-term hardware targets)
    • 47.2.2.1 isle-peer-transport: generic host-side component (~2K lines, replaces device-specific drivers)
    • 47.2.2.2 Live firmware update without host reboot (CLUSTER_LEAVE → update → CLUSTER_JOIN)
    • 47.2.2.3 Attack surface reduction (IOMMU boundary replaces 700K-line Ring 0 driver code)
    • 47.2.5 Peer Kernel Isolation and Crash Recovery (IOMMU hard boundary, unilateral PCIe controls, recovery sequence)
    • 47.12.4 CXL Shared Memory for DSM (hardware coherence eliminates DSM page protocol; DLM still required)
    • 47.12.5 CXL Devices as ISLE Peers (Type 1/2/3 taxonomy, operating models, crash recovery distinctions)
  2. SmartNIC and DPU Integration

Part XIII: Advanced Capabilities13-advanced.md

  1. Power Budgeting
  2. Formal Verification Readiness
  3. Safe Kernel Extensibility
  4. Live Kernel Evolution
  5. Intent-Based Resource Management
  6. Unified Compute Model

Part XIV: Strategy and Roadmap14-roadmap.md

  1. Driver Ecosystem Strategy
  2. Implementation Phases
  3. Verification Strategy
  4. Technical Risks

Appendices14-roadmap.md

Appendix A. Licensing Model: Open Kernel License Framework (OKLF) v1.0 Appendix B. Project Structure Appendix C. What ISLE Provides That Linux Cannot Appendix D. Cross-Feature Integration Map Appendix E. Open Questions

Part XV: User I/O and Human Interfaces15-user-io.md

  1. The TTY and PTY Subsystem
  2. The Input Subsystem (evdev)
  3. Audio Architecture (ALSA Compatibility)
  4. Display and Graphics (DRM/KMS)

Part XVI: Containerization and IPC16-containers.md

  1. Namespace Architecture
  2. Control Groups (Cgroups v2)
  3. POSIX Inter-Process Communication (IPC)
  4. Credential Model and Capabilities

Part XVIII: Agentic Development Timeline18-agentic-timeline.md

  1. Understanding the Bottleneck
  2. Development Model: Parallel Agentic Workflow
  3. Phase-by-Phase Timeline (Agentic)
  4. Total Timeline (Sequential Phases)
  5. Total Timeline (Optimized Parallelism)
  6. What About Spec Bugs?
  7. Hardware Bottlenecks
  8. Human Involvement Required
  9. Realistic Full Timeline (Agentic + Human)
  10. Comparison: Human vs Agentic
  11. Sensitivity Analysis: Slower Inference
  12. Optimistic vs Pessimistic Scenarios
  13. What Determines Success?
  14. Recommendations
  15. Final Answer: Realistic Timeline

Part I: Vision and Foundations

ISLE's core philosophy, protection model, isolation mechanisms, and performance constraints.


1. Overview and Philosophy

What ISLE Is

ISLE is a new hybrid OS kernel designed as a drop-in replacement for the Linux kernel for all practical workloads. Unmodified Linux userspace -- glibc, musl, systemd, Docker, Kubernetes, QEMU/KVM, and the entire ecosystem -- must run without recompilation. The compatibility target is 99.99% of real-world usage: every actively-maintained interface is implemented. The only exclusions are truly obsolete interfaces that have been deprecated for 15+ years (e.g., sysctl(2) system call removed in glibc 2.32, uselib(2) from a.out binary loading, the iopl(2) direct port access from DOS-era programs). If a syscall or ioctl is used by any software released in the last decade, ISLE implements it. The kernel is written primarily in Rust (with C and assembly only for boot code and arch-specific primitives), features a proper hybrid architecture with strong driver isolation, a stable driver ABI, and targets performance parity with monolithic Linux.

Beyond Linux compatibility, ISLE provides capabilities that Linux cannot realistically add to its existing architecture: transparent driver crash recovery without reboot, distributed kernel primitives (shared memory, distributed locking, cluster membership), a unified heterogeneous compute framework for GPUs/TPUs/NPUs/CXL devices, structured observability with automated fault management, per-cgroup power budgeting with enforcement, post-quantum cryptographic verification, and live kernel updates without downtime. These features are designed into the architecture from day one — not retrofitted as afterthoughts.

Why ISLE Exists

Linux's monolithic architecture has fundamental limitations that are nearly impossible to fix within the existing codebase:

  • No driver isolation: A single driver bug crashes the entire system. Drivers account for approximately 50% of kernel code changes and approximately 50% of all regressions and CVEs. When a driver crashes, the entire machine reboots — taking down every VM, container, and long-running job with it.
  • Device drivers are the dominant attack surface: amdgpu alone is ~700,000 lines of Ring 0 code. mlx5 is ~150,000. Every line runs with full kernel privileges. A single memory-safety bug anywhere in that code equals full kernel compromise. Firmware updates require coordinated host driver updates and usually a reboot, entangling device and OS release cycles.
  • No stable in-kernel ABI: Every kernel update can break out-of-tree drivers, requiring constant recompilation (DKMS). Nvidia, ZFS, and every out-of-tree module suffers from this.
  • Coarse-grained locking: RTNL and other legacy locks are scalability bottlenecks on many-core systems. Documented regressions exist on 256+ core servers.
  • No capability-based security: The monolithic privilege model means any kernel vulnerability equals full system compromise.
  • Real-time limitations: PREEMPT_RT still trades throughput for latency and cannot eliminate all unbounded-latency paths.
  • No first-class distribution: Cluster-wide shared memory, locking, and coherence are implemented as ad-hoc userspace layers (Ceph, GFS2 client modules, MPI), not as kernel primitives. Every distributed system reinvents membership, failure detection, and data placement.
  • Heterogeneous compute as afterthought: GPUs, TPUs, NPUs, DPUs, and CXL memory expanders each have their own driver stack, memory model, and scheduling. No unified framework exists for managing heterogeneous resources.
  • Device firmware treated as dumb peripheral: A BlueField DPU runs 16 ARM cores and a full Linux OS. An NVMe controller runs an embedded RTOS. A GPU runs a complete memory manager and scheduler in firmware. These devices already are computers — yet Linux commands them as passive peripherals, with no kernel-level coordination, no capability model, and no fault isolation between host and device. When a DPU crashes or misbehaves, there is no structured recovery: the host has no isolation boundary, no ordered cleanup protocol, and no way to revoke the device's access to host memory without ad-hoc workarounds.
  • Observability bolted on: eBPF, tracepoints, /proc, /sys, and audit are separate subsystems with inconsistent interfaces, added incrementally over decades.
  • "Never break userspace" constrains evolution: Decades of API debt cannot be cleaned up without breaking backward compatibility.

What ISLE Delivers

ISLE is not just "Linux with better isolation." It is a comprehensive rethink of what a production kernel should provide, addressing eight fundamental capabilities:

  1. Driver isolation with crash recovery (Sections 2-3, 6, 9) — When a driver crashes, ISLE recovers it in milliseconds without rebooting. Applications see a brief hiccup, not a system failure. On hardware with fast isolation (MPK, POE), this costs near-zero overhead. On hardware without it, administrators choose their trade-off: slower isolation via page tables, full performance without isolation, or per-driver demotion to userspace. The kernel adapts to available hardware rather than demanding specific features.

  2. Multikernel: device peers, not peripherals (Sections 47.2.2, 47.2.5, 48) — Physically-attached devices that run their own kernel instance (BlueField DPU with 16 ARM cores, RISC-V accelerator, computational storage with Zynx SoC) participate as first-class cluster peers — not managed peripherals. Each device has its own scheduler, memory manager, and capability space. Communication is ISLE message passing over PCIe P2P domain ring buffers (Section 8a Layer 4), the same abstraction used everywhere else. The host needs no device-specific driver — a single generic isle-peer-transport module (~2,000 lines) handles every ISLE peer device regardless of what it does, replacing hundreds of thousands of lines of Ring 0 driver code per device class. Firmware updates are entirely the device's own responsibility: the device sends an orderly CLUSTER_LEAVE, updates its own firmware or kernel independently, and rejoins — the host never reboots, the host driver never changes, and device and OS release cycles are fully decoupled. When a device kernel crashes, the host does not crash: it executes an ordered recovery sequence — IOMMU lockout and PCIe bus master disable in under 2ms, followed by distributed state cleanup, then optional FLR and device reboot (Section 47.2.5). The isolation is physically stronger than Tier 1 driver isolation (IOMMU hard boundary vs. MPK software domain), even though the device is more autonomous. Devices with ARM or RISC-V cores can run the ISLE kernel with zero porting effort, as ISLE already builds for aarch64-unknown-none and riscv64gc-unknown-none-elf.

  3. Distributed kernel primitives (Sections 47-48) — Cluster-wide distributed shared memory (DSM), a distributed lock manager (DLM) with RDMA-native one-sided operations, and built-in membership and quorum protocols. A cluster of ISLE nodes can share memory pages, coordinate locks, and detect failures as kernel-level operations — not userspace libraries. This enables clustered filesystems, distributed caches, and multi-node workloads without bolt-on middleware. The same distributed protocol that connects RDMA-linked servers also connects locally-attached peer kernels (Section 47.2.2), with the transport adapted to PCIe P2P instead of RDMA network.

  4. Heterogeneous compute fabric (Sections 42-46) — A unified framework for GPUs, TPUs, NPUs, FPGAs, and CXL memory. Pluggable per-device schedulers, unified memory tiers (HBM, CXL, DDR, NVMe), and cross-device P2P transfers. New accelerator types plug into the existing framework without kernel modifications.

  5. Structured observability (Sections 39-41) — Fault Management Architecture (FMA) with per-device health telemetry, rule-based diagnosis, and automated remediation. An object namespace (islefs) exposes every kernel object with capability-based access control. Integrated audit logging tied to the capability system, not a separate subsystem.

  6. Power budgeting with enforcement (Section 48) — Per-cgroup power budgets in watts, multi-domain enforcement (CPU, GPU, DRAM, package), and intent-driven optimization. Datacenters can cap power per rack; laptops can maximize battery life per application.

  7. Post-quantum security (Sections 22-26) — Hybrid classical + ML-DSA signatures for kernel and driver verification from day one. No retrofitting needed when quantum computers threaten RSA/ECDSA. Confidential computing support for Intel TDX, AMD SEV-SNP, and ARM CCA.

  8. Live kernel evolution (Section 52) — Replace kernel subsystems at runtime with versioned state migration. Security patches apply without reboot. No more "Update and Restart."

  9. Stable driver ABI (Section 5) — Drivers are binary-compatible across kernel updates. No DKMS, no recompilation on every kernel update. Third-party drivers (GPU, WiFi, storage) work across kernel versions by contract, not by accident.

The Core Technical Challenge

A kernel that is as fast as monolithic Linux, but properly designed as a hybrid with real driver isolation, a stable ABI, and modern security.

This is considered "impossible" because traditional microkernel designs impose 10-50% overhead from IPC-based isolation. ISLE achieves near-zero overhead through four key techniques:

  1. Hardware-assisted Tier 1 isolation — Using the best available mechanism on each architecture (MPK on x86, POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE), domain switches cost approximately 23-80 cycles — not the 600+ cycle IPC of traditional microkernels. On architectures without fast isolation (RISC-V), the kernel adapts: promote trusted drivers to Tier 0, demote untrusted drivers to Tier 2, or accept the page-table fallback overhead. See Section 3.7 for the full adaptive isolation policy.
  2. io_uring-style shared memory rings at every tier boundary, eliminating data copies
  3. PCID/ASID for TLB preservation across protection domain switches, avoiding the flush penalty
  4. Batch amortization of all domain-crossing costs, spreading fixed overhead across many operations

Design Principles

  • Performance is not negotiable. Every abstraction must justify its overhead in cycles.
  • Adapt to available hardware. When the hardware provides fast isolation, use it. When it does not, degrade gracefully — do not refuse to run. A universal kernel must work on everything, even if it means honest trade-offs on some platforms.
  • Isolation is the architecture, not a bolt-on. Driver boundaries are structural, not optional. But the enforcement mechanism adapts to what the hardware provides.
  • Plan for distribution from day one. Shared memory, locking, and coherence protocols are core kernel subsystems, not afterthoughts. Retrofitting distribution into a single-node kernel always produces inferior results.
  • Heterogeneous compute is first-class. GPUs, accelerators, CXL memory, and disaggregated resources are not special cases — they are the normal operating environment for modern workloads.
  • Device firmware is a peer, not a servant. Modern hardware runs its own kernel: NICs run firmware (BlueField DPUs run full Linux), GPUs run scheduling and memory management firmware, storage controllers run RTOS. ISLE's distributed kernel design allows device-local kernels to participate as first-class members of the distributed system — not just passive devices commanded by the host. A cluster can include CPU nodes, GPU nodes, and SmartNIC nodes as equals.
  • Rust ownership replaces runtime checks. Compile-time guarantees replace lockdep, KASAN, and similar debug-only tools.
  • Stable ABI is a first-class contract. Drivers are binary-compatible across kernel updates by design.
  • Linux compatibility is near-complete. If glibc, systemd, or any actively-maintained software calls it, we implement it. Only interfaces deprecated for 15+ years with zero modern users are excluded.

2. Three-Tier Protection Model

ISLE organizes code into three driver tiers — the ISLE Core microkernel (Tier 0), Tier 1 kernel-adjacent drivers, and Tier 2 user-space drivers — plus standard user space. The "three tiers" refer to the three levels at which kernel/driver code executes (Core, Tier 1, Tier 2); user space is not counted as a tier because it uses the standard Linux process model unchanged.

+======================================================================+
|                         ISLE CORE  (Ring 0)                          |
|  Microkernel: Rust + C/asm for arch boot                            |
|                                                                      |
|  - Capability manager         - Physical memory allocator            |
|  - Thread/process management  - Scheduler (CFS/EEVDF + RT + DL)     |
|  - IPC primitives             - MMU / IOMMU programming              |
|  - Interrupt routing          - vDSO maintenance                     |
|  - Virtual memory manager     - Page cache                           |
|  - Timer management           - Linux syscall interface              |
+======================================================================+
         |  MPK switch (~23 cycles)     |  Shared memory (0 copies)
         v                              v
+======================================================================+
|                    TIER 1: Kernel-Adjacent Drivers                    |
|  Ring 0, MPK-isolated (Intel Memory Protection Keys)                 |
|                                                                      |
|  - NVMe, AHCI/SATA            - High-perf NICs (Intel, Mellanox)    |
|  - TCP/IP + UDP stack          - GPU compute drivers                 |
|  - Block I/O layer             - Filesystem impls (ext4, XFS, btrfs) |
|  - VirtIO drivers              - Crypto subsystem                    |
|  - KVM hypervisor              - Netfilter/nftables engine           |
+======================================================================+
         |  Address-space switch           |  IOMMU-isolated
         |  (~200-500 cycles, PCID/ASID)  |  DMA fencing
         v                                v
+======================================================================+
|                    TIER 2: User-Space Drivers                         |
|  Ring 3, separate address space, IOMMU-protected DMA                 |
|                                                                      |
|  - USB drivers                 - Audio (HDA, USB Audio)              |
|  - Input devices               - Bluetooth, WiFi control plane       |
|  - Printers, scanners          - Third-party / vendor drivers        |
|  - Display server drivers      - Non-performance-critical devices    |
+======================================================================+
         |  Standard Linux syscall interface (100% compatible)
         v
+======================================================================+
|                    USER SPACE  (Ring 3)                               |
|  Unmodified Linux binaries: glibc, musl, systemd, Docker, K8s, etc. |
+======================================================================+

Complexity management — The core-of-core (scheduler + memory + caps + IPC) should be as small as feasible. For reference: seL4's verified microkernel is ~10K SLOC (but provides far fewer services), QNX's microkernel is ~100K, and the Zircon kernel (Fuchsia) is ~200K. Any subsystem that grows beyond the minimum necessary for its function should be re-evaluated for extraction to Tier 1.

How the Tiers Interact

ISLE Core to Tier 1: The core switches the MPK protection domain via WRPKRU (a single unprivileged instruction, approximately 23 cycles). Both run in Ring 0 and share the same address space, but MPK keys prevent a Tier 1 driver from reading or writing memory belonging to the core or to other Tier 1 domains. Communication uses shared-memory ring buffers -- zero copies, zero transitions for data.

ISLE Core to Tier 2: Standard process-based isolation. Tier 2 drivers run in Ring 3 with their own address space. Communication uses mapped shared-memory rings for data (zero copy) and lightweight syscall-based notifications. IOMMU restricts DMA to driver-allocated regions.

Tier 1/2 to User Space: No direct interaction for control paths — all user-space requests go through ISLE Core's syscall layer, which dispatches to the appropriate tier. However, the data path does allow direct shared memory: ISLE Core sets up shared ring buffers (Section 8) that are mapped into both the driver and user-space address spaces. Once established, data flows through these rings without ISLE Core mediation (zero-copy). ISLE Core mediates only the ring setup, teardown, and error paths.


3. Isolation Mechanisms: MPK and Alternatives

Intel Memory Protection Keys (MPK) provide hardware-enforced memory domain isolation at a cost comparable to a function call. This is the key technology that makes hybrid kernel performance viable.

Isolation Philosophy: Best Effort Within Performance Budget

Key principle: Driver isolation in ISLE is not a single fixed design point. It is a spectrum that varies across hardware architectures, and the approach is deliberately "best effort within the performance budget" rather than "maximum isolation everywhere."

Why this matters:

  1. Hardware capability varies widely: x86_64 has MPK (16 domains, ~23 cycles). AArch64 has POE (8 domains, ~40-80 cycles) on newer cores, or page-table fallback (~150-300 cycles) on current servers. ARMv7 has DACR (16 domains, ~10-20 cycles). RISC-V has no fast isolation mechanism at all. A design that mandates uniform isolation would either (a) impose unacceptable overhead on some architectures, or (b) fail to leverage better isolation on architectures that support it.

  2. Performance is a requirement, not a nice-to-have: The 5% overhead target is non-negotiable. ISLE must be a drop-in replacement for Linux — if I/O latency increases by 20%, users will not adopt it regardless of how strong the isolation is.

  3. The escape hatch always exists: Any Tier 1 driver can be demoted to Tier 2 (full process isolation) at any time — via per-driver manifest, sysfs knob, or automatic crash-count policy. If an administrator values isolation over performance for a specific workload or hardware configuration, that choice is always available. The tradeoff is explicit and user-controlled.

  4. This is not a bug, it's a feature: Some reviewers may see varying isolation strength across architectures as a "flaw" or "inconsistency." It is neither. It is an honest acknowledgment of hardware reality. The alternative — pretending all architectures have identical isolation capabilities, or mandating full process isolation everywhere (and accepting 20-50% overhead) — would make ISLE impractical for its intended use case as a Linux replacement.

The design contract:

Hardware Tier 1 Isolation Overhead Fallback Option
x86_64 with MPK Strong (MPK domains) ~1-2% Demote to Tier 2 for stronger isolation
AArch64 with POE Strong (POE indices) ~2-4% Demote to Tier 2 for stronger isolation
AArch64 without POE Moderate (page tables) ~6-12% Demote to Tier 2, or promote to Tier 0 for performance
ARMv7 with DACR Strong (DACR domains) ~0.5-1% Demote to Tier 2 for stronger isolation
RISC-V Weak (page tables only) ~8-20% Demote to Tier 2, or promote to Tier 0 for performance
PPC32/PPC64LE Strong-Moderate ~1-5% Demote to Tier 2 for stronger isolation

Summary: ISLE provides the best isolation the hardware can deliver within the performance budget, with a user-controlled escape hatch to stronger isolation (Tier 2) when security requirements exceed what the hardware can efficiently provide. This is a pragmatic engineering tradeoff, not a design flaw.

How MPK Works

Each page table entry contains a 4-bit protection key (PKEY), assigning the page to one of 16 domains (0-15). The PKRU register holds per-domain read/write permission bits. The WRPKRU instruction updates these permissions in approximately 23 cycles (measured: ~23 cycles on Skylake [libmpk, USENIX ATC '19], ~28 cycles on Skylake-SP [EPK, USENIX ATC '22]) -- no TLB flush, no privilege transition, no system call.

Cost Comparison

Mechanism Cost per transition Isolation strength Used for
Function call ~1-5 cycles None Linux monolithic
Intel MPK WRPKRU ~23 cycles Memory domain Tier 1 drivers
Full IPC (seL4-style) ~600-1000 cycles Full address space Too expensive
Address-space switch ~200-600 cycles Full process Tier 2 drivers

MPK gives meaningful isolation -- a Tier 1 driver cannot read or write kernel private data, other driver data, or memory in other MPK domains -- at only approximately 23 cycles per boundary crossing. Combined with IOMMU for DMA fencing, this is the foundation of our performance story.

MPK Domain Allocation

With 16 available domains (PKEY 0-15), the allocation strategy is:

PKEY Assignment
0 ISLE Core (kernel private data)
1 Shared read-only (ring buffer descriptors)
2-13 Tier 1 driver domains (12 available)
14 Shared DMA buffer pool
15 Guard / unmapped

When more than 12 Tier 1 domains are needed, related drivers are grouped into the same domain (for example, all block drivers share one domain, all network drivers share another). This grouping is configurable via policy.

WRPKRU Threat Model: Crash Containment, Not Exploitation Prevention

Critical design constraint: WRPKRU is an unprivileged instruction. Any code running in Ring 0 — including Tier 1 driver code — can execute WRPKRU to modify its own MPK permission register, granting access to any MPK domain including ISLE Core (PKEY 0). This means MPK isolation provides crash containment (preventing buggy drivers from corrupting kernel memory) but does not provide exploitation prevention (compromised Ring 0 code can execute WRPKRU to escape).

Security model — ISLE's Tier 1 isolation is designed to survive driver bugs, not driver exploitation. The rationale: the vast majority of kernel crashes are caused by bugs (null dereference, use-after-free, buffer overrun), not by attackers with arbitrary code execution inside a specific driver. For environments requiring defense against compromised Ring 0 code, Tier 2 (full process isolation) provides the strong boundary — at higher latency cost.

What MPK actually protects against:

  • Accidental memory corruption: Null pointer dereferences, buffer overruns, and similar bugs that write to wrong addresses are contained — the hardware fault triggers before the driver can corrupt kernel memory.
  • Crash recovery: When a driver faults, ISLE Core can safely restart it without system panic because driver memory is isolated from core state.
  • Fault propagation containment: A bug in one Tier 1 driver cannot corrupt data belonging to other drivers or to ISLE Core.

What MPK does NOT protect against:

  • Deliberate exploitation: An attacker who achieves arbitrary code execution within a Tier 1 driver can execute WRPKRU to escape isolation. The instruction is unprivileged by design and the sanctioned switch_domain() trampoline uses it legitimately — it cannot be detected or blocked.
  • Runtime code injection: JIT code or ROP gadgets that contain WRPKRU can execute the instruction directly.

Driver signing — All Tier 1 drivers must be signed (Section 22). An attacker cannot load a malicious driver binary without a valid signature. The attack surface is limited to exploiting bugs in legitimately signed driver code. Combined with Rust's memory safety guarantees and standard Linux hardening (CFI, CET), this raises the bar for achieving arbitrary code execution, but does not eliminate the WRPKRU escape vector.

Tier 2 for exploitation-sensitive workloads — For environments where defense against compromised Ring 0 code is required, drivers should run at Tier 2 (full process isolation). The auto-demotion mechanism (Section 7.10.2) allows administrators to pin specific drivers to Tier 2 via policy, trading higher I/O latency for stronger isolation.

PKRU Write Elision

The ~23-cycle WRPKRU cost is per instruction, not per domain crossing. When an I/O path traverses multiple domains in sequence (e.g., NIC driver → TCP stack → socket layer), a naive implementation issues a WRPKRU at every boundary — 6 writes for a 3-boundary round-trip. UnderBridge (Gu et al., USENIX ATC '20) demonstrated that many of these writes are redundant and can be elided.

When a WRPKRU can be skipped:

  1. Same-permission transition: if domain A and domain B both need read access to a shared buffer, and the only permission change is adding write access to B's private region, the WRPKRU write may be unnecessary if A's private region is already read-disabled. The key insight: WRPKRU sets all 16 domain permissions simultaneously — if the new permission bitmap happens to be identical to the current one, the write is redundant.

  2. Batched transitions: when crossing A → B → C in rapid succession (e.g., NIC driver → TCP → socket), instead of writing PKRU three times (disable A/enable B, disable B/enable C), compute the final PKRU state and write once. The intermediate states are unnecessary if no untrusted code executes between transitions.

  3. Cached PKRU shadow: maintain a per-CPU shadow of the current PKRU value. Before issuing WRPKRU, compare the desired value against the shadow. If identical, skip the instruction entirely. This is a single register comparison (~1 cycle) versus the ~23-cycle WRPKRU.

ISLE implementation — the domain-switch trampoline maintains a per-CPU pkru_shadow variable. The switch_domain() inline function:

#[inline(always)]
fn switch_domain(target_pkru: u32) {
    // Preemption is disabled on this path: the trampoline is entered via
    // the MPK ring buffer dispatch, which disables preemption before
    // crossing the domain boundary. This ensures the shadow read, WRPKRU,
    // and shadow write are atomic with respect to scheduling. On context
    // switch, the scheduler saves/restores PKRU and the shadow together
    // (see Section 14.6, XSAVE/XRSTOR manages PKRU as part of extended register state).
    let shadow = per_cpu::pkru_shadow();
    if shadow != target_pkru {
        // SAFETY: WRPKRU updates permission bits for all 16 MPK domains.
        // target_pkru is computed from the domain allocation table and
        // validated at driver load time — only valid permission sets are
        // reachable. Preemption is disabled (see above).
        unsafe { x86::wrpkru(target_pkru) };
        per_cpu::set_pkru_shadow(target_pkru);
    }
}

Expected savings — on a typical TCP receive path (4 WRPKRU instructions in the naive case: 2 boundary crossings × 2 switches each), PKRU elision reduces this to 2-3 writes (the intermediate transitions where permissions don't actually change are skipped). At ~23 cycles per elided write, this saves ~23-46 cycles per packet — reducing TCP path overhead from ~2% to ~1-1.5%.

MPK on Other Architectures

  • aarch64: ARM Memory Domains (up to 16 domains via DACR on ARMv7) are not available on ARMv8/AArch64 in the same form. ISLE uses ARM's Permission Overlay Extension (POE, ARM FEAT_S1POE, optional from ARMv8.9/ARMv9.4) where available, or falls back to page-table-based domain isolation with ASID-preserving switches. POE provides 8 overlay indices (3 bits from PTE bits [62:60]), with index 0 reserved, giving 7 usable domains — fewer than x86 MPK's 12, so domain grouping is more aggressive on AArch64. Note: current datacenter ARM cores (Neoverse V2/V3, ARMv9.0-9.2) do not implement POE; the page-table fallback is the default path for current ARM servers.
  • armv7: ARMv7 provides hardware Domain Access Control via the DACR register, supporting 16 memory domains (15 usable — domain 0 reserved for kernel). Each domain can be set to No Access, Client (checked against page permissions), or Manager (unchecked access) via a single MCR instruction to update DACR. This is the closest hardware analogue to x86 MPK on 32-bit ARM — a single privileged (MCR p15) register write switches domain permissions without TLB flushes. Unlike x86 WRPKRU (which is unprivileged and executable from Ring 3), DACR writes require PL1 — this is a security advantage: user-space code cannot forge domain switches.
  • riscv64: RISC-V does not yet have an MPK equivalent. The RISC-V SPMP extension (S-mode Physical Memory Protection) provides per-hart physical memory access control, but SPMP is only active when paging is disabled (satp.mode == Bare) and thus cannot be used for Tier 1 isolation in a kernel with virtual memory enabled. The separate Smmtt extension (Supervisor Domain Access Protection) targets multi-tenant isolation but is designed for confidential computing, not MPK-style fast domain switching. In practice, RISC-V Tier 1 isolation uses page-table-based isolation with careful ASID management. This is the only viable mechanism until a true MPK equivalent is ratified for RISC-V.
  • ppc32: PPC32 uses segment registers for memory domain isolation. The 32-bit PowerPC architecture provides 16 segment registers (SR0–SR15), each controlling access to a 256 MB virtual address region. Updating a segment register via mtsr is a single supervisor-mode instruction with low overhead (~10-30 cycles). When segments are insufficient, ISLE falls back to page-table-based isolation.
  • ppc64le: PPC64LE on POWER9+ uses the Radix MMU with partition table entries (process table / PID) for isolation. On POWER8, the Hashed Page Table (HPT) with LPAR (Logical Partitioning) provides hardware-assisted isolation. The Radix MMU's PID-based isolation switches via mtspr PIDR (~30-60 cycles). HPT fallback uses full page table switches (~200-400 cycles).

Per-Architecture Isolation Cost Analysis

The x86_64 MPK WRPKRU instruction provides ~23-cycle domain switches. Other architectures use different mechanisms with different cost profiles:

Architecture Mechanism Domain Switch Cost Domains Notes
x86_64 MPK (WRPKRU) ~23 cycles 12 for drivers 16 total keys. PKEY 0 (core), 1 (shared descriptors), 14 (shared DMA), 15 (guard) reserved for infrastructure. 12 keys (PKEY 2-13) available for Tier 1 driver domains. See Section 5 for details.
x86_64 (fallback) Page table switch + ASID ~200-400 cycles Unlimited Used when MPK unavailable (pre-Skylake). Full CR3 write + TLB management.
aarch64 + POE MSR POR_EL0 + ISB ~40-80 cycles 7 usable POE (Permission Overlay Extension, ARM FEAT_S1POE, optional from ARMv8.9/ARMv9.4) modifies permission overlay register. POR_EL0 is accessible from EL0, like x86 WRPKRU. 8 overlay indices (3 bits from PTE [62:60]), index 0 reserved = 7 usable. Fewer domains than x86 MPK (7 vs 12), requiring more aggressive driver grouping. ISB barrier required (~20-40 cycles). Total ~2-4x MPK cost. Not available on current datacenter cores (Neoverse V2/V3 are ARMv9.0-9.2).
aarch64 + MTE (not viable for domain isolation) N/A N/A MTE (FEAT_MTE2, ARMv8.5) assigns 4-bit tags per 16-byte granule, but tags are compared per-pointer — there is no single-register switch to change which tag values are "valid" for the running domain. Retagging memory or pointers on every domain switch is impractical. MTE is valuable for memory safety (use-after-free, buffer overflow detection) but not for MPK-style domain isolation.
aarch64 (fallback) Page table switch + ASID ~150-300 cycles Unlimited TTBR0_EL1 write + ISB + TLBI ASIDE1IS. ASID avoids full TLB flush but barrier cost is significant.
armv7 DACR (MCR p15) ~10-20 cycles 15 usable Single MCR p15, 0, Rd, c3, c0, 0 writes all 16 domain permissions. No barrier required — DACR updates take effect on next memory access. Comparable to MPK cost.
armv7 (fallback) Page table switch + CONTEXTIDR ~150-300 cycles Unlimited MCR to TTBR0 + ISB + TLBI. Similar cost profile to aarch64 fallback. Used only if DACR domains exhausted.
riscv64 (no fast mechanism available) N/A N/A SPMP only works with paging disabled (satp.mode == Bare); Smmtt targets confidential computing, not domain switching. No MPK equivalent exists for RISC-V with paging enabled. Page-table fallback is the only option.
riscv64 (fallback) Page table switch + ASID ~200-500 cycles Unlimited csrw satp + sfence.vma. Cost varies significantly across implementations (SiFive U74: ~200, other cores: up to ~500).
ppc32 Segment registers (mtsr) ~10-30 cycles 15 usable Single mtsr instruction per 256 MB segment. No barrier required. Comparable to armv7 DACR cost.
ppc32 (fallback) Page table switch ~200-400 cycles Unlimited Full TLB invalidation + page table base update. Similar to other fallback paths.
ppc64le (Radix) PID switch (mtspr PIDR) ~30-60 cycles Process-table scoped POWER9+ Radix MMU. mtspr PIDR + isync. ~2-3x MPK cost.
ppc64le (HPT) HPT + LPAR switch ~200-400 cycles Unlimited POWER8 Hashed Page Table. tlbie + table update. Used as fallback on pre-Radix hardware.

Note on domain counts: The "Domains" column shows raw hardware limits. On x86-64, 4 of the 16 keys are reserved for infrastructure (PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard), leaving 12 for drivers. Other architectures with fast isolation (AArch64 POE, ARMv7 DACR, PPC32 segments) will similarly require infrastructure reservations; the exact allocation will be specified as each architecture's isolation design is finalized. The 7/15/15 "usable" counts are upper bounds before infrastructure deductions.

Impact on performance budget — The Section 4 overhead analysis uses x86_64 MPK (~23 cycles per switch, ~92 cycles per I/O round-trip). On architectures with higher switch costs:

Methodology: NVMe overhead = 4 domain switches × per-switch cost (core↔driver round-trip). TCP overhead = 4 domain switches × per-switch cost (NIC driver ↔ isle-net = 2 boundaries × 2 directions; netfilter runs inside isle-net, same MPK domain). Percentages use 1 cycle ≈ 1 ns (~1 GHz); at modern server clocks (3-5 GHz) the cycle count is the same, but the wall-clock time per switch is lower (e.g., ~23 cycles = ~5 ns at 4.5 GHz vs ~23 ns at 1 GHz). The percentage overhead relative to I/O operations is approximately the same since I/O latency is dominated by device/network time, not CPU clock speed.

Architecture Overhead per NVMe 4KB read Overhead per TCP RX
x86_64 MPK +1% (92 cycles / 10μs) +2% (~92 cycles / 5μs, with NAPI batching; naive per-packet is ~17-26%, see Section 34.5)
aarch64 POE +2-3% (160-320 cycles / 10μs) +3-6% (160-320 cycles / 5μs)
aarch64 fallback +6-12% (600-1200 cycles / 10μs) +12-24% (600-1200 cycles / 5μs)
armv7 DACR +0.5-1% (40-80 cycles / 10μs) +1-2% (40-80 cycles / 5μs)
armv7 fallback +6-12% (600-1200 cycles / 10μs) +12-24% (600-1200 cycles / 5μs)
riscv64 (page table only) +8-20% (800-2000 cycles / 10μs) +16-40% (800-2000 cycles / 5μs)
ppc32 segments +0.5-1% (40-120 cycles / 10μs) +1-2% (40-120 cycles / 5μs)
ppc32 fallback +8-16% (800-1600 cycles / 10μs) +16-32% (800-1600 cycles / 5μs)
ppc64le Radix +1-2% (120-240 cycles / 10μs) +2-5% (120-240 cycles / 5μs)
ppc64le HPT +8-16% (800-1600 cycles / 10μs) +16-32% (800-1600 cycles / 5μs)

For armv7 with DACR and ppc32 with segment registers, the overhead is comparable to or better than x86 MPK — these are the cheapest isolation mechanisms across all supported architectures. For aarch64 with POE and ppc64le with Radix PID, the overhead remains within the 5% budget for storage workloads and within ~5-6% for network-intensive workloads. RISC-V uses page-table fallback exclusively (no fast isolation mechanism exists with paging enabled), which exceeds the 5% target for I/O-heavy workloads. The page-table fallback paths on all architectures are acceptable for non-I/O-bound workloads but motivate prioritizing hardware with fast isolation support (MPK, POE, DACR, segment registers, Radix PID) and the isolation=performance mode for I/O-heavy workloads on fallback-only hardware.

ARM server reality — FEAT_S1POE is optional from ARMv8.9/ARMv9.4. Current datacenter cores are ARMv9.0-9.2: Neoverse V2 (ARMv9.0, used in AWS Graviton 4 and Google Axion) and Neoverse V3 (ARMv9.2, used in AWS Graviton 5 and Azure Cobalt 200) do not implement POE. Cortex-X4 is also ARMv9.2 — no POE. Current ARM server deployments therefore use the page-table fallback (~150-300 cycles). POE-capable server cores are expected in the ARMv9.4+ generation. Note that even when POE is available, AArch64 has only 7 usable domains (vs x86 MPK's 12), so driver domain grouping must be more aggressive on ARM. The page-table fallback overhead of ~6-12% on I/O-heavy workloads is the realistic cost for current ARM servers; this is acceptable for most datacenter workloads where compute dominates I/O.

ARMv7 reality — DACR is universally available on all ARMv7 processors (Cortex-A7 through Cortex-A17, including Raspberry Pi 2/3 in 32-bit mode). The 16-domain limit (15 usable) matches MPK exactly, and the ~10-20 cycle switch cost is the lowest of any supported architecture. ARMv7 targets embedded and IoT use cases where the 32-bit address space (4 GB max) is the primary constraint, not isolation overhead.

RISC-V reality — RISC-V has no MPK equivalent. SPMP (S-mode Physical Memory Protection) only operates with paging disabled, making it unusable for Tier 1 isolation in a kernel with virtual memory. Smmtt (Supervisor Domain Access Protection) is a separate specification targeting confidential computing use cases. RISC-V Tier 1 isolation relies exclusively on the page-table fallback, which has significant overhead for I/O-heavy workloads. The isolation=performance mode (promoting Tier 1 to Tier 0) is recommended for RISC-V I/O-heavy workloads until a fast isolation mechanism is ratified and available in hardware.

Adaptive Isolation Policy (Graceful Degradation)

ISLE targets six architectures with fundamentally different isolation capabilities. Not all hardware provides fast isolation primitives, and that is acceptable. A universal kernel must run on everything — from modern x86 servers with MPK to RISC-V boards with no hardware isolation support. The design philosophy is pragmatic:

Use the best isolation the hardware provides. When the hardware provides nothing, degrade gracefully — don't refuse to run.

This is not a compromise. It is the same approach Linux takes with every hardware feature: use it when available, fall back when not. The difference is that ISLE's fallback paths are explicit, documented, and administrator-configurable rather than invisible and fixed.

Hardware isolation landscape:

Architecture Fast Mechanism Cost Status
x86-64 MPK (WRPKRU) ~23 cycles Available on Skylake+ (2015+)
AArch64 POE (FEAT_S1POE) ~30 cycles Available on ARMv9.3+ (2023+)
ARMv7 DACR ~40 cycles Available on all Cortex-A cores
PPC32 Segment registers ~50 cycles Available on all 32-bit PPC
PPC64LE Radix PID ~60 cycles Available on POWER9+
RISC-V 64 None (page-table only) ~200-500 cycles No fast mechanism exists

When the hardware lacks fast isolation primitives (MPK on x86, POE on AArch64, DACR on ARMv7, segment registers on PPC32, Radix PID on PPC64LE), ISLE must choose between isolation and performance. RISC-V always uses page-table fallback since no fast isolation mechanism exists with paging enabled. Rather than making a one-size-fits-all decision, ISLE provides an adaptive isolation policy with three modes, selectable at boot time via kernel command line or at runtime via sysfs:

Mode 1: isolation=strict (default on hardware with MPK/POE/DACR/segments/Radix)

All Tier 1 drivers run in isolated domains using the hardware-native fast mechanism. This is the designed operating point — full isolation at ~23-80 cycle cost.

Mode 2: isolation=degraded (default on hardware without fast isolation)

The page-table fallback is used for isolation. The system logs a warning at boot:

isle: Hardware lacks fast isolation (MPK/POE/DACR/segments/Radix not detected)
isle: Running in degraded isolation mode — Tier 1 overhead ~200-500 cycles per crossing
isle: For full performance, use hardware with MPK/POE/DACR/segments/Radix or set isolation=performance

This preserves the three-tier model and all safety guarantees, but with measurably higher overhead on I/O-heavy workloads (10-40% depending on path). Acceptable for development, workstations, and workloads that are not I/O-bound.

Mode 3: isolation=performance

On hardware without fast isolation, Tier 1 drivers are promoted to Tier 0 — they run in the same protection domain as isle-core with zero boundary-crossing overhead. Performance matches Linux exactly. The system logs a prominent warning:

isle: isolation=performance: Tier 1 drivers running WITHOUT memory isolation
isle: Driver crashes may cause kernel panic (same as Linux monolithic behavior)
isle: IOMMU DMA fencing is still active — DMA isolation preserved

Key properties of performance mode: - IOMMU DMA fencing remains active — even without MPK memory isolation, DMA operations are still restricted to driver-allocated regions. This prevents a driver bug from DMA-corrupting arbitrary kernel memory. - Crash recovery is best-effort — without memory isolation, a crashing driver may corrupt isle-core state, making recovery impossible. The system attempts recovery but falls back to panic if corruption is detected. - Capability system still enforced — the software-level capability model remains active. A driver without CAP_NET_ADMIN still cannot access network configuration, even in performance mode. MPK enforces memory access; capabilities enforce API access. Only the memory enforcement is relaxed. - Security model partially degraded — a malicious driver could exploit the shared address space. This mode is appropriate for trusted environments with known drivers, not for running untrusted third-party code.

Per-driver tier pinning — administrators can override the global policy on a per-driver basis in the driver manifest:

# isle-nvme driver manifest
[driver]
name = "isle-nvme"
preferred_tier = 1
minimum_tier = 1

# Override: on hardware without MPK, run this driver as Tier 0
# instead of using the slow page-table fallback
[driver.isolation_fallback]
no_fast_isolation = "promote_tier0"   # "promote_tier0" | "page_table" | "demote_tier2"

Options per driver: - promote_tier0: run in Tier 0 (fast, no isolation) — for performance-critical drivers - page_table: use page-table fallback (slow, but isolated) — default - demote_tier2: move to Tier 2 userspace (full process isolation, highest overhead) — for untrusted or crash-prone drivers where safety outweighs performance

The macOS lesson — Apple's transition from kexts (Ring 0, no isolation) to DriverKit (userspace, full isolation) took 5 years and was driven by the observation that kernel extensions caused the majority of crashes. ISLE's approach is more nuanced: rather than a binary choice between "fast and dangerous" and "safe and slow," hardware-assisted isolation (MPK, POE, DACR, segments, Radix PID) provides a third option — "fast and safe" — on modern hardware. The adaptive isolation policy ensures ISLE remains a viable drop-in replacement on older hardware by honestly trading off isolation for performance when the hardware cannot support both simultaneously.

Isolation is one of eight core capabilities, not the only one. Even on hardware without fast isolation (RISC-V in Tier 0 mode, older x86 without MPK), ISLE still provides: driver crash recovery (best-effort in Tier 0, full in Tier 2), distributed kernel primitives, heterogeneous compute management, structured observability, power budgeting, post-quantum security, live kernel evolution, and a stable driver ABI. A RISC-V server running isolation=performance loses one capability (memory isolation) but retains the other seven. This is a practical trade-off, not a design failure.


4. Performance Budget

Target: less than 5% total overhead versus Linux on macro benchmarks.

The overhead comes exclusively from protection domain crossings on I/O paths — operations that must transit between ISLE Core and Tier 1 drivers. Operations handled entirely within ISLE Core (scheduling, memory management, page faults, vDSO calls) have zero additional overhead compared to Linux. Syscall dispatch itself adds ~23 cycles for the MPK domain transition, but this is only measurable on I/O-bound syscalls; for compute-heavy syscalls (e.g., mmap, mprotect, brk) the ~23 cycles are negligible relative to the work done.

Per-Operation Overhead

Operation Linux ISLE Overhead Notes
Syscall dispatch (I/O) ~100 cycles ~123 cycles +23 cycles Bare SYSCALL/SYSRET round-trip without KPTI or Spectre mitigations. +23 cycles for MPK domain switch to Tier 1 driver. With KPTI + Spectre mitigations (production default), Linux syscall entry/exit is ~700-1800 cycles. On non-Meltdown-vulnerable CPUs (Intel Ice Lake+, all AMD Zen), ISLE avoids KPTI page table switches for intra-core syscalls. On Meltdown-vulnerable CPUs (Intel Skylake through Cascade Lake), KPTI is a hardware requirement that ISLE cannot avoid -- both Linux and ISLE pay the same KPTI cost on these CPUs.^1
NVMe 4KB read (total) ~10 us ~10.1 us +1% 4 MPK switches = ~92 cycles on 10 us op
TCP packet RX ~5 us ~5.1 us +2% 4 MPK switches (NIC driver + isle-net, enter + exit each) = ~92 cycles + ring overhead on 5 us op
Page fault (anonymous) ~300 cycles ~300 cycles 0% Handled entirely in ISLE Core
Context switch (minimal) ~200 cycles ~200 cycles 0% Register save/restore only (same mechanism as Linux). This is the lmbench-style minimal context switch between threads in the same address space. A full process context switch with TLB flush and cache effects costs 5,000-20,000 cycles on Linux; ISLE's intra-tier switches (MPK domain change) avoid this by not changing address spaces.
vDSO (clock_gettime) ~25 cycles ~25 cycles 0% Mapped directly into user space
epoll_wait (ready) ~80 cycles ~80 cycles 0% Handled in ISLE Core
mmap (anonymous) ~400 cycles ~400 cycles 0% Handled in ISLE Core

^1 KPTI note: Kernel Page Table Isolation is required on x86 CPUs vulnerable to Meltdown (Intel client/server cores from Nehalem through Cascade Lake). On these CPUs, every kernel entry/exit pays the KPTI page table switch cost (~200-1000 cycles depending on microarchitecture and TLB pressure). This is a hardware-imposed requirement, not a software choice -- ISLE cannot avoid it any more than Linux can. The "~123 cycles" figure in the table assumes a non-vulnerable CPU (Intel Ice Lake+, AMD Zen, ARM, RISC-V) or one with hardware Meltdown fixes. On Meltdown-vulnerable hardware, add the KPTI overhead to both the Linux and ISLE columns equally.

Macro Benchmark Targets

Benchmark Acceptable overhead vs Linux
fio randread/randwrite 4K QD32 < 2%
iperf3 TCP throughput < 5%
nginx small-file HTTP < 5%
sysbench OLTP < 5%
hackbench < 3%
lmbench context switch < 1%
Kernel compile (make -jN) < 5%

Where the Overhead Comes From

The overhead budget is dominated by I/O paths that cross the MPK boundary. Pure compute workloads, memory-intensive workloads, and scheduling-intensive workloads have effectively zero overhead because they stay entirely within ISLE Core.

Worst case: a micro-benchmark that issues millions of tiny I/O operations per second (e.g., 4K random IOPS at maximum queue depth). Even here, the ~92 cycles of MPK overhead per operation is less than 1% of the approximately 10 us total device latency.

Comprehensive Overhead Budget

The 5% target is for macro benchmarks on x86_64 with MPK. The total overhead is the sum of all ISLE-specific costs versus a monolithic Linux kernel. This section enumerates every source of overhead so the budget can be audited:

Source Per-event cost (x86_64 MPK) Frequency Contribution to macro benchmarks
MPK domain switches ~23 cycles per WRPKRU 2-6 per I/O op 1-4% of I/O-heavy workloads. 0% for pure compute.
IOMMU DMA mapping 0 (same as Linux) Per DMA op 0% — ISLE uses IOMMU identically to Linux.
KABI vtable dispatch ~2-5 cycles (indirect call) Per driver method call <0.1% — indirect call vs direct call. Branch predictor hides this.
Capability checks ~5-10 cycles (bit test) Per privileged op <0.1% — bitmask test, fully pipelined.
Driver state checkpointing ~0.2-0.5 μs per checkpoint (memcpy + doorbell) Periodic (every ~1ms) ~0.02-0.05% — amortized over 1ms. HMAC computed asynchronously by isle-core, not on driver hot path.
Scheduler (EAS + PELT) 0 (same algorithms as Linux) Per context switch 0% — ISLE uses the same CFS/EEVDF + PELT as Linux.
Scheduler (CBS guarantee) ~50-100 cycles Per CBS replenishment <0.05% — replenishment every ~1ms for CBS-enabled groups only.
FMA health checks ~10-50 cycles Per device poll (~1s) <0.001% — background, amortized over seconds.
Stable tracepoints 0 when disabled, ~20-50 cycles when enabled Per tracepoint hit 0% disabled. <0.1% when actively tracing.
islefs object bookkeeping ~50-100 cycles Per object create/destroy <0.01% — object lifecycle is cold path.
In-kernel inference 500-5000 cycles per invocation Per prefetch/scheduling decision <0.1% — invoked on slow-path decisions (page reclaim, I/O reordering), not per-I/O. Clamped by cycle watchdog.

Workload-specific overhead estimates (x86_64 MPK):

Workload Dominant overhead source Estimated total overhead
fio 4K random IOPS MPK switches (4 per I/O) ~1-2%
iperf3 TCP throughput MPK switches (NIC + TCP) ~3-5%
nginx small-file HTTP MPK switches (NIC + TCP + FS) ~3-5%
sysbench OLTP MPK switches (NVMe + TCP) ~2-4%
hackbench (IPC-heavy) MPK switches (scheduler stays in core) ~1-3%
Kernel compile (make -jN) Nearly zero (CPU-bound, in-core) <1%
memcached (GET-heavy) MPK switches (NIC + TCP) ~3-5%
ML training (GPU) Nearly zero (GPU work, not CPU I/O) <1%

Key insight: the overhead budget is dominated by a single factor — MPK domain switch cost multiplied by the number of domain crossings per operation. All other sources combined contribute less than 0.5%. This means overhead optimization efforts should focus exclusively on reducing the number of MPK crossings per I/O operation (via batching, persistent mappings, and keeping hot-path decisions within isle-core).

Non-x86 architectures: see Section 3 "Per-Architecture Isolation Cost Analysis" for overhead estimates on aarch64 and riscv64, where isolation switch costs are higher.