Skip to content

Sections 55–58 and Appendices A–E of the ISLE Architecture. For the full table of contents, see README.md.


Part XIV: Strategy and Roadmap

Ecosystem strategy, implementation phases, verification, and technical risks.


55. Driver Ecosystem Strategy

55.1 The Challenge

Driver coverage is the single largest adoption blocker for any new kernel. Linux has thousands of drivers covering decades of hardware. ISLE cannot replicate this overnight.

55.2 Agentic Driver Rewrite Project

The key insight: all open-source Linux driver source code is available. The hardware programming logic (register sequences, DMA setup, interrupt handling) is identical regardless of kernel API. Only the kernel-facing API surface changes.

AI-assisted translation pipeline:

Input:  Linux driver C source code (GPL, ~500-5000 LOC typical)
   |
   v
Step 1: Parse Linux kernel API calls (kmalloc, dma_alloc_coherent,
        request_irq, pci_read_config_*, etc.)
   |
   v
Step 2: Map to KABI equivalents (KernelServicesVTable methods)
   |
   v
Step 3: Translate C to Rust, preserving hardware-specific logic exactly
   |
   v
Step 4: Generate KABI driver entry point and vtable exchange
   |
   v
Output: Native Rust KABI driver

Human review: Verify hardware-specific sequences are preserved
Testing: Against real hardware + QEMU virtual devices

55.3 Prioritized Driver List

These drivers cover approximately 95% of real hardware in server and desktop environments:

Priority 1 -- Cloud/VM (covers 100% of cloud deployments): 1. VirtIO block (virtio-blk) 2. VirtIO network (virtio-net) 3. VirtIO GPU (virtio-gpu) 4. VirtIO console (virtio-console)

Priority 2 -- Storage (covers 99% of bare-metal storage): 5. NVMe (universal modern SSD interface) 6. AHCI/SATA (legacy HDDs and older SSDs)

Priority 3 -- Networking (covers 90% of server NICs): 7. Intel e1000/e1000e (universal VM and consumer NIC) 8. Intel igb/ixgbe/ice (server 1G/10G/25G/100G) 9. Realtek r8169 (consumer Ethernet) 10. Mellanox mlx5 (high-performance datacenter)

Priority 4 -- Human Interface (covers desktop usability): 11. USB XHCI host controller (all modern USB) 12. USB EHCI host controller (USB 2.0 legacy) 13. USB HID (keyboard, mouse) 14. USB mass storage 15. Intel HDA audio 16. i915 (Intel integrated graphics, modesetting) 17. amdgpu (AMD graphics, modesetting) 18. UVC (USB Video Class) camera driver (Phase 3/4)

Priority 5 -- Platform (covers system management): 18. ACPI subsystem 19. PCI/PCIe enumeration and configuration 20. IOMMU (Intel VT-d, AMD-Vi)

55.4 Nvidia / Proprietary Driver Strategy

For Nvidia (the most critical proprietary driver):

  • Nvidia's driver already has a clean internal abstraction layer between their proprietary GPU core and the "kernel interface layer" (nvidia.ko)
  • ISLE provides a KABI-native implementation of this kernel interface layer
  • Nvidia's proprietary compute core links against our KABI implementation
  • This is more sustainable than binary .ko compatibility: the interface layer is small, well-defined, and stable

55.5 Community Incentive

The clean KABI SDK makes driver development significantly easier than Linux: - No need to track unstable internal APIs - Rust safety eliminates entire classes of bugs - Binary compatibility across kernel versions eliminates recompilation burden - Clear, documented interfaces reduce the learning curve

This lower barrier to entry is expected to attract contributors and vendors over time.


56. Implementation Phases

This section covers the implementation timeline for all features. The first part (Phases 1-5+) defines core kernel milestones. The Enhancement Feature Phasing and Future-Proof Feature Phasing tables below map additional features onto these same phases.

Phase 1: Foundations

Goal: Boot to a hello-world program.

  • ISLE Core: x86-64 boot (UEFI + BIOS), physical memory allocator, basic scheduler, IPC/isolation domain infrastructure
  • LinuxCompat: minimal syscalls for execve + write + exit_group
  • Tier 0 drivers: APIC, timer, serial console
  • Build system: Cargo workspace with custom target spec, linker scripts
  • CI/CD: QEMU-based boot tests on every commit
  • KABI compiler: .kabi IDL parser and Rust/C code generator (v1)

Exit criteria: A statically linked 'Hello, world!' ELF binary runs on ISLE in QEMU. The KABI compiler successfully parses a minimal .kabi IDL file and generates Rust/C stubs that compile without errors.

Phase 2: Self-Hosting Shell

Goal: Run a busybox shell with basic utilities.

  • VFS layer: mount table, path resolution, file descriptor table
  • Filesystems: tmpfs, initramfs (cpio), procfs (basic), sysfs (stub)
  • Block I/O layer + VirtIO-blk driver (Tier 1)
  • Memory manager: mmap, brk, page fault handler, COW, demand paging
  • Process management: fork/clone, execve, wait, exit
  • Basic signal handling: SIGCHLD, SIGKILL, SIGTERM, SIGSEGV
  • Pipe and simple I/O

Exit criteria: Busybox shell boots, ls, cat, echo, ps work.

Phase 3: Real Workloads

Goal: Boot systemd, run Docker containers.

  • Full syscall coverage: approximately 330+ commonly used syscalls (from a dispatch table covering ~450 total Linux syscall numbers, with uncommon/obsolete syscalls returning -ENOSYS)
  • NVMe driver (Tier 1), ext4 filesystem (read-write)
  • Network: VirtIO-net driver, e1000 driver, TCP/IP stack, socket API
  • Namespaces: all 8 types
  • Cgroups: v2 (primary) + v1 compat
  • io_uring: full implementation
  • eBPF: verifier + JIT (x86-64) + core map types + XDP/tc/kprobe programs
  • seccomp-bpf: for container runtime compatibility
  • Full signal handling: all 64 signals, sigaction, sigaltstack
  • TTY/PTY subsystem: for terminal emulators

Exit criteria: Ubuntu minimal boots with systemd, Docker runs hello-world container, iperf3 and fio benchmarks complete.

Phase 4: Production Ready

Goal: Drop-in replacement for specific workloads.

  • KVM hypervisor: /dev/kvm, VMX, EPT, QEMU/Firecracker support
  • Netfilter/nftables: connection tracking, NAT, Docker networking
  • LSM framework: SELinux policy engine, AppArmor profiles
  • Agentic driver rewrite: top-20 driver families ported
  • Performance tuning: reach within 5% of Linux on all target benchmarks
  • Crash recovery: full Tier 1/2 fault injection testing
  • Package: .deb and .rpm packages for Ubuntu 24.04+ and Fedora 40+
  • LTP conformance: Linux Test Project suite passing (>95% of applicable tests)

Exit criteria: ISLE boots unmodified Ubuntu 24.04 and Fedora 40, runs Docker + Kubernetes single-node, passes LTP, within 5% of Linux on benchmarks.

Phase 5: Ecosystem

Goal: Broad adoption and platform maturity.

  • ARM64 port: full Tier 1 isolation using architecture-appropriate mechanisms
  • RISC-V 64 port: same
  • PPC32 port: embedded PowerPC support with segment-register isolation
  • PPC64LE port: IBM POWER server support with Radix MMU isolation
  • Extended driver coverage: GPU acceleration (i915, amdgpu compute), WiFi, Bluetooth
  • Vendor partnerships: Nvidia KABI driver, AMD KABI driver, Intel KABI driver
  • Community driver development: SDK documentation, examples, mentorship
  • Distribution certification: RHEL, Ubuntu, SUSE official support
  • Nested virtualization: KVM-on-KVM
  • Live kernel upgrade: stop all Tier 1/2 drivers, swap core, restart drivers

Enhancement Feature Phasing

The kernel-internal enhancements described in Sections 13, 15, 22, and 39-41 have different urgency levels relative to the phases above:

Feature Earliest Phase Rationale
Unified Object Namespace (Section 41) Phase 1-2 Foundational — other features build on it
Stable Tracepoints (Section 40) Phase 2 Needed for debugging from the start
Memory Compression (Section 13) Phase 3 Requires mature memory manager
Verified Boot (Section 22) Phase 3 Requires bootable system to protect
CPU Bandwidth Guarantees (Section 15) Phase 3-4 Requires mature scheduler + cgroups
Fault Management (Section 39) Phase 4 Requires mature driver ecosystem reporting health

The following table covers the implementation timeline for advanced features (Parts XI-XIV). Phase numbers align with the core kernel phases defined above. "Design-In" items (Phase 1) require data structure reservations and trait definitions but no functional implementation. Higher-phase items depend on core infrastructure being available.

Feature Phase Dependencies Design-In Cost Notes
PQC crypto abstraction (§25) Phase 1 None Low Variable-length signature fields, algorithm enum
Formal verification readiness (§50) Phase 1 None Low Spec annotations, design contracts
RT preemption model (§16) Phase 1-2 Scheduler Medium Lock design, interrupt threading
Hardware memory safety hooks (§19) Phase 2 Memory allocator Low Tag allocation/deallocation in slab/buddy
Power budgeting (§49) Phase 3 Scheduler, cgroups Medium RAPL/SCMI reading, power cgroup controller. Per-task EAS is in Section 14.5
Safe kernel extensibility (§51) Phase 3 KABI, domain isolation Medium Policy vtable traits, module lifecycle
Confidential computing — guest (§26) Phase 3 Memory manager Medium Bounce buffers, shared/private pages
Confidential computing — host (§26) Phase 4 isle-kvm, IOMMU Medium SEV-SNP/TDX VM management
PQC algorithm implementations (§25) Phase 3-4 Crypto abstraction Medium ML-KEM, ML-DSA, hybrid mode
Live kernel evolution (§52) Phase 4-5 Extensibility Medium State export/import, atomic swap
Intent-based management (§53) Phase 4-5 Inference engine, cgroups Medium Optimization loop, intent cgroup knobs
SmartNIC/DPU offload (§48) Phase 4-5 Device registry, proxy drivers Medium Offload transport, DPU discovery
Persistent memory (§32) Phase 4-5 VFS, memory tiers Medium DAX, MAP_SYNC, CLWB fencing
Computational storage (§33) Phase 5+ AccelBase framework Low CSD as AccelDeviceClass
Unified compute topology (§54) Phase 4-5 AccelBase, EAS (Section 14.5), power budgeting (§49) Low Advisory overlay; multi-dim capacity profiles, cross-device energy
Unified cgroup compute.weight (§54) Phase 5+ Unified topology, intent optimizer (§53) Low Optional knob; orchestration layer over existing per-domain knobs
NodeTransport unification (§54.13) Phase 5 KernelTransport (Section 47), OffloadTransport (§48) Medium Merge RDMA + PCIe + NVLink + CXL into one transport abstraction
Peer kernel nodes (§54.13) Phase 5+ NodeTransport, distributed kernel (Section 47) Low Vendor-driven; architecture ready, adoption depends on industry

Priority Rationale

Phase 1-2 (Design-In): PQC sizing, verification readiness, RT lock design. These cost almost nothing now but are impossible to retrofit. Design contracts and data structure sizes affect everything built on top.

Phase 3 (Real Workloads): Extensibility, power budgeting, confidential guest mode. These enable the kernel to run real workloads in modern environments (cloud, power- constrained datacenters).

Phase 4-5 (Competitive Advantage): Live evolution, intent-based management, DPU offload. These are features that Linux cannot provide due to architectural constraints. They differentiate ISLE in production environments.


56.1 Licensing Summary

Component IP Source Risk
Confidential computing (TEE) Hardware vendor specs (AMD SEV, Intel TDX, ARM CCA), all public None
Post-quantum crypto NIST standards (FIPS 203, 204, 205), public domain algorithms None
Power budgeting RAPL (Intel public spec), SCMI (ARM public spec), original design None
Hardware memory safety ARM MTE (public ISA), Intel LAM (public ISA) None
Formal verification Verus (MIT license), RustBelt (academic, published) None
Safe extensibility Original design (extends existing KABI vtable model) None
Live kernel evolution Theseus OS concepts (academic, published, Rice University) None
Intent-based management Original design, optimization theory (academic) None
Real-time guarantees PREEMPT_RT concepts (GPLv2, Linux mainlined), CBS (academic) Medium — see note below
SmartNIC/DPU offload Original design (extends existing KABI proxy model) None
Persistent memory DAX/PMEM specifications (SNIA, public), Linux interfaces (facts) None
Computational storage NVMe Computational Programs Command Set and Subsystem Local Memory Command Set (public, NVMe consortium, January 2024) None
Unified compute model Original design (extends existing AccelBase + EAS models) None

All components are either original design, based on published academic research, based on public hardware specifications, or based on NIST/industry standards. No vendor-proprietary APIs or patented algorithms.

PREEMPT_RT derivative risk: PREEMPT_RT is GPLv2 and was merged into Linux mainline (v6.12). Any ISLE real-time code derived from PREEMPT_RT implementation (as opposed to the general concepts of preemptible kernels, threaded interrupts, and priority inheritance) could carry GPLv2 obligations that conflict with OKLF's additional permissions. ISLE's RT implementation MUST be a clean-room design based on published academic literature (priority inheritance protocols: Sha, Rajkumar, Lehoczky 1990; CBS: Abeni and Buttazzo 1998; LITMUS-RT: Brandenburg 2011) and public OS design textbooks, not derived from Linux PREEMPT_RT source code. Code review must verify no Linux-derived lock conversion patterns, interrupt threading structures, or RT-specific scheduler modifications are copied.


56.2 Performance Impact Summary

Every feature in this document was evaluated against the constraint: "Does this make ISLE measurably slower than Linux on the same workload?"

Feature Hot-Path Impact vs Linux Justification
Confidential computing 0% (same hardware, same cost) Hardware AES engine, identical to Linux
Post-quantum crypto 0% (cold-path only) Boot/driver-load only. ML-DSA-44 verify comparable to Ed25519; ML-DSA-65 verify ~100-200 µs (cold-path only, not on hot paths)
Power budgeting 0.015% (MSR reads at tick) 600ns per 4ms tick. Invisible in any benchmark. Per-task EAS overhead: see Section 14.5.12
Hardware memory safety 0% vs Linux when enabled Same MTE instructions, same hardware cost. Tag RAM overhead: 3.125% of DRAM (ARM MTE only)
Formal verification 0.000% (compile-time) Not in the binary
Safe extensibility 0% (same as Linux sched_class) Function pointer dispatch, same mechanism
Live kernel evolution 0.000% (rare event only) ~10μs during replacement, months between events
Intent-based management ~0.00005% (background only) 3μs per second background optimization
Real-time guarantees 0% to 5% (configurable) Same cost as Linux PREEMPT_RT when enabled. 0% = Voluntary, ~1% = Full, 2-5% = Realtime
SmartNIC/DPU offload Negative (faster) Moves work OFF host CPU
Persistent memory Negative (faster) DAX eliminates page cache copies
Computational storage Negative (faster) CSD reduces data movement
Unified compute model ~0.00005% (background only) ~4μs/sec/cgroup advisory. Submission hot path unchanged

Target: match or exceed Linux performance for all common workloads. Most features are invisible at steady state, and several actually improve performance. Known exceptions are conscious trade-offs documented in their respective sections: RT scheduling adds 0-5% overhead for RT-class tasks (same cost as Linux PREEMPT_RT); capability checks add ~5-10 cycles per privileged operation (~0.1%, fully pipelined bitmask test); untrusted policy module isolation adds ~46 cycles per domain crossing (eliminated once the module graduates to the Core isolation domain).


56.3 Consumer and Desktop Phases (Phase 5)

Phase 5 focuses on consumer hardware support, desktop integration, and application ecosystem compatibility. These phases begin after Phase 4 server/cloud stability.

56.4 Phase 5a: Essential Consumer Hardware

Goal: ISLE boots and runs on common Intel/AMD laptops with basic functionality.

Deliverables: - WiFi drivers (Intel, Realtek) - Bluetooth stack (HID, audio) - Touchpad drivers (I2C-HID, PS/2) - Audio (Intel HDA, USB Audio) - Graphics (Intel i915 modesetting, basic AMD) - S3 suspend/resume

56.5 Phase 5b: Consumer Power Management

Goal: Battery life competitive with existing Linux distributions.

Deliverables: - Power profiles (performance, balanced, battery-saver) - S0ix Modern Standby support - Per-app power attribution kernel interfaces

56.6 Phase 5c: Desktop Integration

Goal: Polished desktop experience, ready for enthusiast adoption.

Deliverables: - Wayland compositor support (DRM, input events) - Multi-monitor support (hotplug) - Desktop notifications (battery, network, USB events) - Per-app sandboxing capability primitives

56.7 Phase 5d: Broader Hardware

Goal: Support popular consumer laptops (ThinkPad, XPS, etc.).

Deliverables: - More WiFi chipsets (Qualcomm, Mediatek, Broadcom) - AMD graphics (amdgpu modesetting) - Thunderbolt 3/4 support - USB4 support - SATA, eMMC, SD card readers

56.8 Phase 5e: Gaming & Creative

Goal: Support gaming, content creation workloads.

Deliverables: - Vulkan drivers (Mesa RADV for AMD, Intel ANV) - Steam + Proton support - GPU video encode/decode (hardware acceleration)

56.9 Desktop / Laptop Performance Targets

Performance targets for ISLE running on consumer-grade desktop/laptop hardware. These are acceptance criteria for Phase 5 completion, not kernel architectural constraints — specific numbers are deployment-profile goals.

Metric Target
Kernel boot (bootloader → login screen) < 5 seconds
Resume from S3 suspend < 2 seconds
Resume from S4 hibernate < 10 seconds
Idle power (WiFi on, display on) Match or exceed Ubuntu 24.04
Video playback (1080p H.264) Hardware decode; CPU < 5%

56.9.1 Validation Methodology

Battery life: - Side-by-side comparison with Windows 11 and Ubuntu 24.04 on the same hardware - Standardised web-browsing benchmark (Speedometer + video stream) - ISLE must match or exceed Ubuntu 24.04 battery life

Real-world validation: - 100+ beta testers (developer community) running ISLE as daily driver - 30-day soak; collect crash dumps, performance traces, battery statistics


57. Verification Strategy

57.1 Testing Layers

Layer Tool / Method What it verifies
Unit tests cargo test (in QEMU or host mock) Individual subsystem correctness
Integration tests Custom test harness in QEMU Cross-subsystem interactions
Syscall conformance Linux Test Project (LTP) Syscall behavior matches Linux
Application testing Boot Ubuntu minimal, Alpine Real-world application compatibility
Container testing Docker hello-world, nginx, redis Container runtime compatibility
Kubernetes testing k3s single-node Orchestration platform compatibility
ABI regression kabi-compat-check in CI No breaking changes to KABI
Crash recovery Fault injection framework Tier 1/2 drivers recover correctly
Performance regression Automated benchmarks vs Linux baseline No unacceptable performance regression
Fuzzing syzkaller (adapted for ISLE) Syscall fuzzing for crash/hang detection
Static analysis cargo clippy, custom lints Code quality, unsafe usage review

57.2 Key Benchmarks

These benchmarks must match Linux within 5% (measured on identical hardware, same kernel configuration, same workload parameters):

Benchmark What it tests Target delta
fio randread 4K QD32 Block I/O fast path (IOPS) < 2%
fio randwrite 4K QD32 Block I/O write path (IOPS) < 2%
fio sequential read 1M Block I/O throughput (GB/s) < 1%
iperf3 TCP throughput Network stack throughput < 5%
iperf3 TCP latency (RR) Network stack latency < 5%
nginx small-file HTTP (wrk) Combined network + filesystem < 5%
redis-benchmark In-memory key-value (network + mem) < 3%
sysbench OLTP read-write Database workload (IO + CPU + sched) < 5%
hackbench (groups=100) Scheduler + IPC throughput < 3%
lmbench lat_ctx Context switch latency < 1%
Kernel compile (make -jN) Combined CPU + IO + scheduling < 5%
stress-ng mixed Overall system stress < 5%

57.3 Crash Recovery Testing

Dedicated fault injection framework:

  • Domain isolation violation: Trigger write to wrong domain from Tier 1 driver
  • Null pointer dereference: In Tier 1 and Tier 2 drivers
  • Infinite loop: Verify watchdog detects and kills Tier 1 driver
  • DMA to wrong address: Verify IOMMU blocks it
  • Driver process crash: Verify Tier 2 supervisor restarts it
  • Repeated crashes: Verify auto-demotion policy engages
  • I/O in flight during crash: Verify all pending requests complete with -EIO

Each test verifies: (1) the system does not panic, (2) the driver recovers within the target time, (3) applications see errors but can retry, and (4) no memory is leaked.

57.4 CI Pipeline

Every commit triggers:

1. cargo build --target x86_64-unknown-none
2. cargo test (host-side unit tests)
3. QEMU boot test (basic boot + shutdown)
4. kabi-compat-check (no ABI breaks)
5. cargo clippy (lint pass)
6. cargo fmt --check (formatting)

Every merge to main additionally triggers:

7. LTP syscall conformance suite
8. Docker container boot test
9. Performance benchmark suite (vs stored Linux baseline)
10. Crash recovery fault injection suite

58. Technical Risks

Risk Impact Likelihood Mitigation
MPK provides only 16 domains Medium Certain Group related drivers by fault domain (all block share domain, all net share domain). 12 driver-available domains on x86 (4 keys reserved for infrastructure: PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard; per Section 3/5). AArch64 POE has 7 usable indices (1-7), of which 3 are available for Tier 1 driver domains (indices 3-5; indices 1-2 reserved for isle-core, 6 for userspace, 7 for temporary/debug; per Section 58.3). See "MPK Domain Grouping" below for degraded isolation analysis.
eBPF verifier complexity High High Verifier subsystem is ~30K SLOC in Linux (kernel/bpf/verifier.c core is ~20K SLOC, plus ancillary analysis passes). Start with subset of program types, expand incrementally. ISLE implements a clean-room Rust verifier and JIT (GPL avoidance); the eBPF bytecode format and helper API are compatible with Linux but the implementation is original.
KVM deeply integrated with Linux MM High High Design memory manager with KVM hooks from the start (Phase 1 architecture). Dedicate a team to KVM from Phase 3.
Driver coverage gap blocks adoption Critical High Cloud-first strategy (VirtIO covers 100% of VMs). Prioritize top-20 drivers. Agentic rewrite pipeline for open-source drivers.
Subtle syscall compatibility bugs High High LTP conformance suite, real-world application testing, syzkaller fuzzing. Build a comprehensive test matrix of applications.
Spectre/Meltdown mitigations + domain isolation Medium Medium KPTI not needed for Tier 1 (same Ring 0). Tier 2 needs standard KPTI. Retpoline/IBRS for indirect branches. Test on affected hardware.
IOMMU not available on all hardware Medium Medium IOMMU required for Tier 1 DMA fencing. Systems without IOMMU fall back to trusted mode (reduced isolation, logged warning).
ARM64 lacks direct MPK equivalent Medium Certain Use POE (FEAT_S1POE, 7 usable indices of which 3 are for Tier 1 drivers, optional from ARMv8.9+) or page-table fallback. Adaptive isolation policy (Section 3) allows per-driver tier pinning or promotion to Tier 0 on pre-POE hardware.
No fast isolation on pre-2020 x86 Medium Certain Adaptive isolation policy: isolation=performance promotes Tier 1 to Tier 0 (Linux-equivalent speed, no memory isolation). IOMMU DMA fencing still active.
Rust ecosystem maturity for OS dev Low Medium Established patterns from Redox, Linux rust-for-linux, Hubris. Use #![no_std] and custom allocator. Unsafe blocks at hardware boundaries are expected and audited.
Performance target too ambitious Medium Medium 5% target is for macro benchmarks. Micro-benchmarks may show higher overhead on specific paths. Batch amortization and careful profiling.
Community adoption / contributor pipeline Medium Medium Clean SDK, good documentation, lower barrier than Linux driver development. Cloud-first focus builds credibility before desktop push.
Regulatory / certification barriers Low Low Work with distributions early. Open-source everything except vendor proprietary blobs.
LZ4/Zstd kernel implementation correctness Medium Medium Fuzzing, comparison with reference implementation. Use no_std BSD-licensed implementations with comprehensive test vectors.
Object namespace overhead on hot paths Low Low Lazy registration for high-frequency objects (fds, sockets, VMAs). Eagerly registered objects only (~2000 baseline = ~384 KB).
CBS scheduling fairness under edge cases Medium Medium Formal analysis against CBS paper (Abeni 1998), stress testing with adversarial workloads, comparison with Linux cpu.max behavior.

58.1 Risks from Advanced Features (Parts XI-XIV)

Risk Impact Likelihood Mitigation
TEE hardware fragmentation (SEV-SNP vs TDX vs CCA) High Certain Abstract behind ConfidentialContext trait (Section 26.3). Implement one backend at a time. SEV-SNP first (largest cloud deployment), TDX second, CCA third.
PQC algorithm instability (NIST may revise) Medium Medium Algorithm-agile abstraction (Section 25.2). Algorithms behind enum dispatch; swapping ML-KEM for a successor is a library update, not a kernel redesign.
PQC signature sizes impact IPC latency Low Certain ML-DSA-65 signatures are 3,309 bytes (per NIST FIPS 204, Table 2). Cold-path only (capability minting, not every IPC call). SignatureData::Heap variant avoids ring buffer bloat (Section 25).
RT + domain isolation interaction causes priority inversion High Medium Domain switch (WRPKRU on x86) is ~23 cycles (no lock needed). Domain switching is O(1) — no contention path. If priority inheritance needed for domain-shared buffers, use PI futexes (Section 16.3).
Formal verification scope creep Medium Medium Verify only security-critical paths: capability table, IPC ring, page table mapping (Section 50). Accept that ~80% of kernel code is tested, not verified.
DPU vendor lock-in (proprietary firmware) Medium High KABI vtable for OffloadTransport (Section 48). DPU-specific code is behind the same driver isolation as any Tier 1 device. Vendor-specific logic in driver, not kernel.
PMEM/CXL hardware not yet widely deployed Low High Design is hardware-agnostic (Section 32). All PMEM code compiles out when hardware is absent. CXL 3.0 adoption expected 2025-2027; architecture ready, implementation deferred.
Unified compute model adds scheduling overhead Medium Low Advisory overlay only — existing schedulers unchanged (Section 54). Topology queries are O(1) reads from cached ComputeCapacityProfile. No hot-path cost.
Live kernel evolution causes state corruption Critical Low Post-swap watchdog with 5-second timer (Section 52). On crash, the system attempts to re-extract state from the failing component; if extraction fails, the system panics rather than reverting to stale state, preventing silent data corruption. State serialization uses versioned HMAC integrity tags.
Intent optimizer makes poor decisions Low Medium Intent system is purely advisory (Section 53). Clamping prevents invalid resource configs. Worst case: system falls back to static defaults (no intent optimization).

58.2 Risk Response Priority

  1. Driver coverage (Critical): Addressed by cloud-first strategy + agentic rewrite
  2. Syscall compatibility (High): Addressed by LTP + application test matrix
  3. eBPF complexity (High): Addressed by incremental implementation
  4. KVM integration (High): Addressed by early architectural planning
  5. TEE fragmentation (High): Addressed by trait-based abstraction
  6. RT + domain isolation interaction (High): Addressed by O(1) domain switching design
  7. Domain limit (Medium): Addressed by driver grouping policy
  8. Live evolution safety (Critical but low likelihood): Addressed by watchdog + state HMAC integrity checks

58.3 Domain Grouping: Degraded Isolation Analysis

When more than 12 Tier 1 drivers are loaded simultaneously, some drivers must share an isolation domain (protection key). This is an inherent limitation of Intel's 16-key PKU design (16 keys minus PKEY 0 for isle-core, minus PKEY 1 for shared descriptors, minus PKEY 14 for shared DMA, minus PKEY 15 as guard = 12 usable). Grouping has concrete consequences for fault isolation:

What grouping preserves: - IOMMU isolation: each driver retains its own IOMMU domain regardless of domain grouping. DMA fencing is unaffected — a crashing NVMe driver cannot DMA into a NIC driver's buffers, even if they share an isolation domain. - Capability isolation: each driver has its own capability set. Sharing an isolation domain does not grant access to another driver's capabilities. - Crash detection: fault injection and page-fault trapping still identify the crashing driver (via instruction pointer, not isolation domain).

What grouping degrades: - Memory read/write isolation between grouped drivers. If drivers A and B share isolation domain 5, a buffer overrun in A can corrupt B's data structures. The crash is still contained (it cannot escape to isle-core or other domains), but it may take down both A and B. - The blast radius of a crash expands from one driver to one domain group. In practice, this means a faulty NVMe driver could take down the AHCI driver if both are in the "block" group.

Grouping policy — drivers are grouped by fault domain affinity (drivers that interact heavily and would likely cascade-fail anyway):

Isolation Domain Group Typical Members
2 Block storage NVMe, AHCI/SATA, virtio-blk, iSCSI, NVMe-oF
3 Network (data) Intel NIC, Mellanox NIC, virtio-net
4 Network (stack) TCP/IP, UDP, RDMA core
5 Filesystem ext4, XFS, btrfs
6 Display DRM/KMS, GPU compute
7 KVM VMX/SVM, vhost-net, vhost-scsi
8 USB xHCI, USB hub, USB mass storage
9-13 Individual High-value drivers that warrant solo domains

AArch64 domain budget: POE provides 7 usable indices (1-7; index 0 is reserved for isle-core as the default PTE value). Of the 7 usable indices, 4 are reserved for infrastructure (index 1 for shared read-only, index 2 for shared DMA, index 6 for userspace, index 7 for temporary/debug), leaving only 3 indices for Tier 1 driver domains (indices 3-5; vs. 12 on x86). AArch64 therefore requires much more aggressive grouping.

Note for reviewers: ARM FEAT_S1POE uses a 3-bit POIndex field in page table entries, providing 8 index values (0-7). This is a hardware constraint, not a design choice. Index 0 is the default PTE value (per ARM architecture), leaving 7 configurable indices. Do not suggest "use 4 bits for 16 indices" — the POIndex field width is fixed by the ISA. The grouping table above is designed for x86's 12-domain budget. On AArch64, the kernel applies a reduced grouping scheme: - Domain 0: isle-core (default PTE value) - Domain 1: Shared read-only - Domain 2: Shared DMA buffer pool - Domain 3: VFS + block I/O (merged — these are tightly coupled) - Domain 4: Network stack - Domain 5: All remaining Tier 1 drivers (single shared domain) - Domain 6: Userspace (EL0 default) - Domain 7: Temporary / debug This reduces isolation granularity for Tier 1 drivers on AArch64 (all share one domain) but preserves the critical isle-core/driver/userspace boundaries. The architecture-specific grouping is selected at boot based on arch::current::isolation::domain_count().

Typical server scenario — a cloud server runs NVMe + NIC + TCP + KVM + virtio = 5 drivers. On x86 (12 driver domains), these fit in 5 domains with no grouping needed; the 12-domain limit only triggers on heavily-configured systems (desktop with GPU + audio + USB + Bluetooth + WiFi + NVMe + SATA + NIC + ...). On AArch64 with POE (3 driver domains), even this typical 5-driver configuration requires grouping -- the reduced scheme above merges block I/O, networking, and remaining drivers into 3 shared domains. Architectures with more domains (ARMv7 DACR: 15, PPC32 segments: 15) behave more like x86.

Monitoring — when grouping occurs, ISLE logs a warning: isle: isolation domain 1 shared by nvme, ahci (reduced isolation: crash in either affects both)

This allows administrators to make informed decisions about which drivers to load as Tier 2 (full process isolation, unlimited domains) if they require stronger isolation than domain grouping provides.


Appendices

Reference material, comparison tables, and open questions.


Appendix A. Licensing Model: Open Kernel License Framework (OKLF) v1.3

ISLE uses the Open Kernel License Framework (OKLF) v1.3 (see OKLF-v1.3.md for the full legal text). Key elements:

Base license: GPLv2-only with additional permissions (Sections 2-5 of OKLF). All kernel code — isle-core, isle-kernel, isle-compat, isle-net, isle-vfs, isle-block, isle-kvm, tools, and boot code — is GPLv2. This ensures: - All kernel modifications must be open-sourced - Proprietary forks are impossible - Same legal framework the Linux ecosystem understands

Approved Linking License Registry (ALLR): A curated, append-only list of open-source licenses approved for use with kernel code. Tiers 1-2 may link with kernel code directly (Tier 0/1 drivers). Tier 3 licenses are GPL-incompatible and may NOT link with the kernel; Tier 3 code runs exclusively as Tier 2 process-isolated drivers communicating via KABI IPC, where no linking occurs: - Tier 1 (weak copyleft, GPL-compatible): MPL-2.0, LGPL-2.1, EPL-2.0 (with Secondary License designation; see note below) - Tier 2 (permissive): MIT, BSD-2, BSD-3, Apache-2.0, ISC, Zlib - Tier 3 (incompatible — process isolation required, no linking): CDDL-1.0, CDDL-1.1, LGPL-3.0, EUPL-1.2 (see note below)

LGPL-3.0 incompatibility with GPLv2-only: LGPL-3.0 is incompatible with GPLv2-only code per the FSF compatibility matrix. LGPL-3.0 is defined as GPLv3 plus additional permissions (LGPL-3.0 Section 1: "This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License"). Since GPLv3 is incompatible with GPLv2-only (see GPLv3 exclusion note below), LGPL-3.0 inherits that incompatibility. LGPL-3.0 code must NOT be linked into the ISLE kernel. LGPL-3.0 code communicates with the kernel via KABI IPC only (Tier 3, process isolation required). Note that LGPL-2.1 IS compatible with GPLv2 and remains in Tier 1.

EUPL-1.2 classification (Tier 3): EUPL-1.2 is a strong copyleft license that the FSF classifies as GPL-incompatible. While EUPL Article 5 provides a compatibility list (including GPLv2, GPLv3, LGPL, AGPL, MPL-2.0, EPL-1.0, CeCILL) that allows EUPL-licensed code to be relicensed under those licenses when combined with code under those licenses, the FSF's position is that EUPL-1.2's copyleft is "comparable to the GPL's, and incompatible with it" by itself. ISLE places EUPL-1.2 in Tier 3 (process isolation required, no linking with kernel code) as the conservative default. EUPL-1.2 code that has been explicitly relicensed to GPLv2 via Article 5 by its copyright holder may then be treated as GPLv2 code and used in Tier 0/1. Without explicit relicensing, EUPL-1.2 code runs as a Tier 2 process-isolated driver communicating via KABI IPC only.

EPL-2.0 GPL compatibility: EPL-2.0 is GPL-compatible only when the distributor explicitly designates GPL as a Secondary License per EPL-2.0 §3.2. Without this designation, EPL-2.0 is GPL-incompatible. ISLE requires EPL-2.0 dependencies to carry the Secondary License designation; undesignated EPL-2.0 code is treated as Tier 3 (process isolation required, no linking with kernel code). ALLR Tier 1 inclusion applies only to EPL-2.0 code that explicitly carries the Secondary License designation for GPLv2. Enforcement: the KABI module loader checks for the Secondary License designation in the module's license metadata at load time. EPL-2.0 modules without the designation are rejected for Tier 0/1 loading and must run as Tier 2 process-isolated drivers. Additionally, EPL-2.0's patent grant (§2.2) requires contributors to grant a patent license for their contributions; ISLE cannot enforce this at a technical level, so EPL-2.0 code in Tier 1 carries an implicit assumption that upstream contributors have complied with §2.2. Code review should verify the Secondary License designation is present in the upstream project's license header, not just claimed in module metadata.

GPLv3 exclusion from ALLR: GPLv3 is deliberately excluded from the ALLR. ISLE's kernel is licensed GPLv2-only (not "GPLv2 or later"). GPLv3 is incompatible with GPLv2-only code per the FSF: GPLv3's additional requirements (anti-tivoization in Section 6, patent retaliation in Section 11) constitute "further restrictions" that GPLv2 Section 7 prohibits. Code licensed GPLv3-only cannot be linked into a GPLv2-only kernel. Code licensed "GPLv2 or later" CAN be used (under its GPLv2 grant), but code licensed GPLv3-only cannot. Adding GPLv3 to the ALLR would create a false impression that GPLv3-only code may be linked with the kernel. If GPLv3-only code must be used, it must run as a Tier 2 process-isolated driver (same as CDDL), communicating via KABI IPC with no linking.

CDDL and GPL incompatibility: CDDL is GPL-incompatible per the FSF. CDDL-licensed code may only run as Tier 2 drivers (separate address space, separate process). Despite CDDL appearing in the ALLR, no linking occurs between CDDL code and GPL kernel code. CDDL drivers communicate exclusively via KABI IPC (message passing across the process boundary), which provides the same license isolation as Linux's use of FUSE for ZFS-FUSE. In-kernel (Tier 0/1) CDDL code is NOT permitted. The KABI process boundary ensures the CDDL code and GPL code never form a single "work" in the copyright sense.

New licenses added via governance process (60-day review, supermajority LGB vote). Licenses are never removed (append-only for legal certainty).

Proprietary kernel-space code explicitly prohibited (OKLF Section 4.2(c)): Any code that loads into kernel address space and accesses internal kernel symbols is a derivative work and must comply with GPLv2 or an ALLR-listed license. This removes Linux's 30-year "gray area" about proprietary kernel modules.

Proprietary user-space drivers explicitly permitted (OKLF Section 4.2(b)): Code interacting with the kernel exclusively through the stable userspace interface (syscalls, /proc, /sys, VFIO, UIO, FUSE, eBPF) is not a derivative work. This maps directly to our Tier 2 driver model — hardware vendors who cannot open-source their drivers may use user-space driver frameworks with full isolation.

Anti-tivoization stance (OKLF Section 5): OKLF encourages but does not mandate installation information disclosure. The OKLF adds only additional permissions to GPLv2 (permitted by GPLv2 Sections 0 and 10), never additional restrictions. Anti-tivoization protection is achieved indirectly: the KABI stability guarantee means users can always replace a Tier 1/2 driver binary without modifying the kernel, making hardware lockdown of individual drivers less effective.

Firmware exception (OKLF Section 4.3): Binary firmware that runs on separate processors (GPU microcode, Wi-Fi firmware, SSD firmware) is outside the license scope. Distributed separately in firmware/. Code running on the main CPU is NOT firmware.

Legal risk acknowledgment — OKLF is a novel license framework built on GPLv2. While it is designed to be GPLv2-compatible (the "additional permissions" model is explicitly contemplated by GPLv2 §0 and §10), it has not been tested in court and constitutes a novel legal approach that should not be relied upon without independent legal review. Key risks: (1) the ALLR mechanism may be viewed by some lawyers as an untested extension of the "linking exception" concept — FSF/SFLC review is recommended before v1.0 final; (2) the OKLF provides weaker anti-tivoization protection than GPLv3, which is an accepted tradeoff for GPLv2 compatibility — OKLF cannot mandate installation information disclosure without violating GPLv2's "no further restrictions" clause; (3) ecosystem adoption depends on corporate legal teams accepting OKLF as GPLv2-compatible — even if legally sound, unfamiliarity may slow adoption; (4) the "additional permissions" model under GPLv2 §0/§10 is well-established in principle (e.g., GCC Runtime Library Exception, Qt commercial exception), but OKLF's scope (ALLR registry, driver tier classification, firmware exception) goes beyond typical additional permissions — a court could find that some OKLF provisions constitute "further restrictions" rather than "additional permissions," which GPLv2 §7 prohibits. This risk is mitigated by careful drafting but cannot be eliminated without judicial precedent. ISLE should seek early legal review from SFLC or equivalent, and provide a "plain GPLv2" fallback for organizations that cannot accept OKLF's additional terms.

KABI Driver SDK: The isle-driver-sdk crate (ABI type definitions, vtable layouts, ring buffer protocol, DMA types) is dual-licensed Apache-2.0 OR MIT. This is the interface contract — drivers of any ALLR-listed license can link against these types without friction.

How this maps to our driver tiers:

Tier Location License requirement OKLF section
Tier 0 (boot-critical) In-kernel, static GPLv2 or ALLR 4.1 (in-tree)
Tier 1 (domain-isolated) Ring 0, loaded GPLv2 or ALLR 4.2 (out-of-tree open-source)
Tier 2 (user-space) Ring 3, process Any (incl. proprietary) 4.2(b) (userspace interface)

Three ABI stability tiers (extending OKLF Section 3):

Interface Stable? Policy
Internal kernel APIs No May change between any two releases
KABI (driver ABI) Yes Versioned, append-only, binary-stable
Userspace ABI (syscalls) Yes Never broken without extended deprecation
Concern How addressed
Prevent proprietary kernel forks GPLv2 copyleft
Allow ZFS (CDDL) CDDL in ALLR Tier 3 — ZFS runs as a Tier 2 process-isolated driver (KABI IPC provides license boundary, no linking occurs)
Allow Nvidia GPU (proprietary) Tier 2 user-space driver via VFIO
Allow BSD/MIT drivers BSD/MIT in ALLR — full kernel-space access
Force kernel improvements to be open GPLv2 copyleft on all kernel crates
Module enforcement Kernel refuses non-compliant modules by default
Clear legal boundaries OKLF explicit text, not legal gray area

Appendix B. Project Structure

Note: This appendix describes the target project structure at full implementation. The current codebase (see CLAUDE.md "Project Structure") contains the foundational crates (isle-kernel, isle-core, isle-driver-sdk, isle-compat, isle-net, isle-vfs, isle-block, isle-kvm). Additional crates listed below (e.g., isle-accel, isle-cluster, drivers/) will be added as their corresponding architecture sections are implemented.

isle-kernel/
  Cargo.toml                        # Workspace root (all crates)
  ARCHITECTURE.md                   # This document

  isle-core/                        # Microkernel core
    Cargo.toml
    src/
      main.rs                       # Boot entry point (calls arch-specific init)
      cap/                          # Capability system
        mod.rs                      #   Capability types, tables, operations
        revocation.rs               #   Generation-based revocation
      mem/                          # Memory management
        phys.rs                     #   Physical page allocator (buddy)
        vmm.rs                      #   Virtual memory manager (maple tree, VMAs)
        page_cache.rs               #   Page cache (RCU radix tree)
        slab.rs                     #   Slab allocator for kernel objects
        pcid.rs                     #   PCID/ASID management
        huge.rs                     #   Huge page (THP + explicit) support
      sched/                        # Scheduler
        mod.rs                      #   Scheduler core, class dispatch
        cfs.rs                      #   CFS/EEVDF fair scheduler
        rt.rs                       #   RT FIFO/RR scheduler
        deadline.rs                 #   Deadline (EDF/CBS) scheduler
        balance.rs                  #   NUMA-aware load balancer
      ipc/                          # IPC and isolation
        mpk.rs                      #   MPK domain management, WRPKRU helpers
        ring.rs                     #   Shared-memory ring buffers
        tier2_ipc.rs                #   Cross-address-space IPC for Tier 2
      arch/                         # Architecture-specific Rust code
        mod.rs                      #   Architecture trait definitions
        x86_64/                     #   x86-64 implementation
          mod.rs
          gdt.rs                    #     GDT setup
          idt.rs                    #     IDT and interrupt dispatch
          apic.rs                   #     Local APIC driver (Tier 0)
          timer.rs                  #     HPET/TSC/APIC timer (Tier 0)
          mpk.rs                    #     MPK hardware interface
          vmx.rs                    #     VMX support for KVM
        aarch64/                    #   ARM64 implementation (phase 2+)
          mod.rs
        armv7/                      #   ARMv7 implementation (phase 2+)
          mod.rs
        riscv64/                    #   RISC-V 64 implementation (phase 2+)
          mod.rs
        ppc32/                      #   PPC32 implementation (phase 2+)
          mod.rs
        ppc64le/                    #   PPC64LE implementation (phase 2+)
          mod.rs

  isle-compat/                      # Linux syscall interface + compat shims
    Cargo.toml
    src/
      syscall/                      # ~450 syscall dispatch table
        mod.rs                      #   SyscallHandler enum, dispatch table
        process.rs                  #   fork, clone, execve, exit, wait
        file.rs                     #   open, read, write, close, ioctl
        memory.rs                   #   mmap, brk, mprotect, madvise
        network.rs                  #   socket, bind, listen, accept, connect
        time.rs                     #   clock_gettime, nanosleep, timer_*
        misc.rs                     #   getpid, getuid, uname, sysinfo
      proc/                         # /proc filesystem emulation
        mod.rs
        meminfo.rs                  #   /proc/meminfo
        cpuinfo.rs                  #   /proc/cpuinfo
        pid.rs                      #   /proc/[pid]/* (maps, status, fd, etc.)
        sys.rs                      #   /proc/sys/* (sysctl interface)
      sys/                          # /sys filesystem emulation
        mod.rs
        devices.rs                  #   /sys/devices/ device tree
        class.rs                    #   /sys/class/ device classes
        bus.rs                      #   /sys/bus/ bus enumeration
      dev/                          # /dev filesystem emulation
        mod.rs
        devtmpfs.rs                 #   devtmpfs-compatible device nodes
      signal/                       # Signal handling
        mod.rs
        delivery.rs                 #   Signal delivery to user space
        handlers.rs                 #   Default handlers, core dump
      namespace/                    # Linux namespace implementation
        mod.rs
        mnt.rs                      #   Mount namespace
        pid.rs                      #   PID namespace
        net.rs                      #   Network namespace
        user.rs                     #   User namespace
        ipc.rs                      #   IPC namespace
        uts.rs                      #   UTS namespace
        cgroup.rs                   #   Cgroup namespace
        time.rs                     #   Time namespace
      cgroup/                       # Cgroup v1/v2
        mod.rs
        v2.rs                       #   Unified hierarchy (primary)
        v1_compat.rs                #   Legacy hierarchy (compatibility)
        controllers/                #   cpu, memory, io, pids, etc.
      io_uring/                     # io_uring subsystem
        mod.rs
        ring.rs                     #   SQ/CQ ring management
        sqpoll.rs                   #   SQPOLL kernel thread
        ops.rs                      #   Operation dispatch
      lsm/                          # Linux Security Modules
        mod.rs
        hooks.rs                    #   Hook framework
        selinux.rs                  #   SELinux policy engine
        apparmor.rs                 #   AppArmor profile engine
        seccomp.rs                  #   seccomp-bpf filter
      ebpf/                         # eBPF subsystem
        mod.rs
        vm.rs                       #   eBPF virtual machine
        verifier.rs                 #   Static verifier
        jit/                        #   JIT compilers
          x86_64.rs
          aarch64.rs
          armv7.rs
          riscv64.rs
          ppc32.rs
          ppc64le.rs
        maps.rs                     #   Map types (hash, array, ringbuf, etc.)
        helpers.rs                  #   eBPF helper functions
        programs.rs                 #   Program types (XDP, tc, kprobe, etc.)

  isle-net/                         # Network stack (runs as Tier 1)
    Cargo.toml
    src/
      tcp/                          # TCP/IP implementation
      udp/                          # UDP implementation
      ip/                           # IP layer (v4 + v6)
      arp.rs                        # ARP
      icmp.rs                       # ICMP
      netfilter/                    # nftables + iptables compatibility
        mod.rs
        nft.rs                      #   nftables engine
        conntrack.rs                #   Connection tracking
        nat.rs                      #   NAT (SNAT, DNAT, masquerade)
      xdp/                          # XDP fast path
      socket.rs                     # Socket abstraction
      tunnel/                       # Tunnel protocol modules (§35)
        mod.rs                      #   TunnelDevice trait
        vxlan.rs                    #   VXLAN encap/decap
        geneve.rs                   #   Geneve encap/decap
        gre.rs                      #   GRE/GRE6
        ipip.rs                     #   IPIP/SIT
        wireguard.rs                #   WireGuard VPN
      bridge/                       # Software L2 switch (§35)
        mod.rs                      #   Bridge device, FDB, STP
        vlan.rs                     #   802.1Q VLAN filtering
      veth.rs                       # Virtual ethernet pairs
      macvlan.rs                    # macvlan/ipvlan devices
      vrf.rs                        # Virtual Routing and Forwarding

  isle-vfs/                         # Virtual filesystem layer (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # VFS dispatch, mount table
      ext4/                         # ext4 filesystem
      xfs/                          # XFS filesystem
      btrfs/                        # btrfs filesystem
      tmpfs/                        # tmpfs (in-memory)
      overlayfs/                    # OverlayFS (for containers)
      dcache.rs                     # Directory entry cache

  isle-block/                       # Block I/O layer (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # Block device abstraction
      scheduler.rs                  # I/O schedulers (mq-deadline, none, bfq)
      partition.rs                  # Partition table parsing (GPT, MBR)
      dm/                           # Device-mapper framework (§29)
        mod.rs                      #   DM core: target dispatch, table management
        linear.rs                   #   dm-linear
        striped.rs                  #   dm-striped
        mirror.rs                   #   dm-mirror
        crypt.rs                    #   dm-crypt (AES-XTS)
        verity.rs                   #   dm-verity
        snapshot.rs                 #   dm-snapshot (COW)
        thin.rs                     #   dm-thin-pool
      md.rs                         # MD RAID (0/1/5/6/10) superblock compat
      lvm.rs                        # LVM2 metadata reader
      recovery.rs                   # Recovery-aware volume state machine
      iscsi/                        # iSCSI block storage (§30)
        mod.rs                      #   iSCSI common: PDU parsing, session state
        initiator.rs                #   iSCSI initiator (RFC 7143)
        target.rs                   #   iSCSI target (LIO-compatible config)
        iser.rs                     #   iSER — RDMA transport for iSCSI
        chap.rs                     #   CHAP authentication
        multipath.rs                #   dm-multipath integration
      nvmeof/                       # NVMe over Fabrics (§30)
        mod.rs                      #   NVMe-oF common: capsule parsing, queue pairs
        host.rs                     #   NVMe-oF initiator (host) — connect, I/O
        target.rs                   #   NVMe-oF target (subsystem) — nvmetcli compat
        tcp.rs                      #   NVMe/TCP transport (TP 8000)
        rdma.rs                     #   NVMe/RDMA transport (TP 8001)
        discovery.rs                #   Discovery controller client/server
        ana.rs                      #   ANA multipath — asymmetric namespace access

  isle-kvm/                         # KVM hypervisor (Tier 1)
    Cargo.toml
    src/
      mod.rs                        # /dev/kvm interface
      vmx.rs                        # Intel VMX
      svm.rs                        # AMD SVM
      mmu.rs                        # Nested page tables (EPT/NPT)
      tee/                          # Confidential VM support (§26)
        sev.rs                      #   AMD SEV-SNP guest/host
        tdx.rs                      #   Intel TDX guest/host
        cca.rs                      #   ARM CCA realm management

  isle-accel/                       # AI/ML accelerator subsystem (§42)
    Cargo.toml
    src/
      mod.rs                        # AccelBase trait, device registration
      scheduler.rs                  # CBS-based accelerator scheduler
      hmm.rs                        # Heterogeneous memory management
      p2p.rs                        # Peer-to-peer DMA (PCIe, NVLink, CXL)
      inference.rs                  # In-kernel inference engine
      rdma.rs                       # RDMA and collective ops

  isle-cluster/                     # Distributed kernel (§47)
    Cargo.toml
    src/
      mod.rs                        # Cluster topology, node discovery
      transport.rs                  # KernelTransport (RDMA, CXL, TCP)
      ipc.rs                        # Distributed IPC proxy
      dsm.rs                        # Distributed shared memory
      dlm.rs                        # Distributed Lock Manager (§31a)
      global_pool.rs                # Global memory pool
      scheduler.rs                  # Cluster-wide scheduling
      caps.rs                       # Network-portable capabilities

  isle-driver-sdk/                  # Stable driver SDK
    Cargo.toml
    interfaces/                     # .kabi IDL definitions
      block_device.kabi             #   Block device interface
      net_device.kabi               #   Network device interface
      gpu_device.kabi               #   GPU device interface
      input_device.kabi             #   Input device interface
      usb_device.kabi               #   USB device interface
      char_device.kabi              #   Character device interface
      pci_device.kabi               #   PCI device interface
      platform_device.kabi          #   Platform device interface
    src/
      lib.rs                        # SDK entry point, driver registration
      abi.rs                        # Generated stable ABI types
      dma.rs                        # DMA buffer management
      mmio.rs                       # MMIO access helpers (volatile read/write)
      irq.rs                        # Interrupt handling
      ring.rs                       # Ring buffer helpers for driver use
      manifest.rs                   # Driver manifest parsing

  drivers/                          # In-tree drivers
    tier0/                          # Boot-critical (statically linked)
      apic/                         #   Local APIC + I/O APIC
      timer/                        #   PIT / HPET / TSC
      serial/                       #   Early serial console
      vga/                          #   Early VGA text console
    tier1/                          # Performance-critical (domain-isolated)
      nvme/                         #   NVMe SSD driver
      virtio_blk/                   #   VirtIO block device
      virtio_net/                   #   VirtIO network device
      virtio_gpu/                   #   VirtIO GPU
      virtio_console/               #   VirtIO console
      e1000/                        #   Intel e1000 NIC
      igb/                          #   Intel igb NIC
      ahci/                         #   AHCI/SATA controller
      ext4/                         #   ext4 driver component
    tier2/                          # Isolated (user-space process)
      usb_xhci/                     #   USB XHCI host controller
      usb_hid/                      #   USB HID (keyboard, mouse)
      usb_storage/                  #   USB mass storage
      hda_audio/                    #   Intel HDA audio
      input/                        #   Input subsystem (evdev)

  tools/
    kabi-compiler/                  # .kabi IDL -> Rust/C code generator
      Cargo.toml
      src/
        main.rs
        parser.rs                   #   IDL parser
        codegen_rust.rs             #   Rust binding generator
        codegen_c.rs                #   C binding generator
    kabi-compat-check/              # ABI compatibility CI checker
      Cargo.toml
      src/
        main.rs                     #   Diffs old vs new .kabi, rejects breaks
    isle-initramfs/                 # Initramfs builder tool
      Cargo.toml
      src/
        main.rs                     #   Packs drivers + early userspace

  arch/                             # Architecture-specific C/asm
    x86_64/
      boot/                         # UEFI/BIOS boot stub (C + asm)
        header.S                    #   Linux boot protocol header
        main.c                      #   Early C boot code
        efi_stub.c                  #   UEFI stub
      asm/
        entry.S                     #   Syscall entry/exit
        switch.S                    #   Context switch
        irq_stubs.S                 #   Interrupt stub table
      vdso/
        vdso.lds                    #   vDSO linker script
        clock_gettime.c             #   clock_gettime implementation
        getcpu.c                    #   getcpu implementation
    aarch64/
      boot/                         # ARM64 boot stub
      asm/                          # ARM64 assembly
      vdso/                         # ARM64 vDSO
    riscv64/
      boot/                         # RISC-V boot stub
      asm/                          # RISC-V assembly
      vdso/                         # RISC-V vDSO
    ppc32/
      boot/                         # PPC32 boot stub
      asm/                          # PPC32 assembly
      vdso/                         # PPC32 vDSO
    ppc64le/
      boot/                         # PPC64LE boot stub
      asm/                          # PPC64LE assembly
      vdso/                         # PPC64LE vDSO

  tests/
    abi_compat/                     # Old driver binaries for compat regression
    syscall/                        # Linux syscall conformance (LTP-based)
    driver/                         # Driver integration tests
    bench/                          # Performance regression benchmarks
    crash_recovery/                 # Fault injection + recovery verification

Appendix C. What ISLE Provides That Linux Cannot

Feature Linux ISLE
Driver crash recovery Kernel oops or panic depending on fault type. Many driver bugs produce oops (system continues with degraded functionality) rather than panic. Recovery requires at minimum driver module reload; severe faults cause panic and full reboot (30-60s). Reload driver in ~50-150ms (Tier 1) or ~10ms (Tier 2)
Stable driver ABI None (recompile every update) Versioned, append-only, binary-stable KABI
Driver isolation None (shared address space) Domain isolation + IOMMU (Tier 1), full process (Tier 2)
Capability-based security Bolt-on (POSIX caps are coarse) Foundational architecture
Lock ordering enforcement Runtime lockdep (debug only) Compile-time via Rust type system: type-level lock ordering using phantom type parameters that encode lock level in the type signature (e.g., Lock<Level3>), preventing out-of-order acquisition at compile time. See isle-core lock design (Section 16).
io_uring security Bypasses syscall monitoring Per-instance operation whitelist
Hot driver upgrade Fragile (unstable ABI) Clean stop/start with stable KABI
Memory safety C everywhere Rust with minimal unsafe at hardware boundaries
Many-core scalability known bottlenecks (RTNL for networking, inode_lock for VFS, cgroup_mutex for cgroups) No global locks, per-CPU/per-NUMA everywhere
Proactive fault management Ad-hoc (mcelog, rasdaemon) Unified FMA with diagnosis engine (Section 39)
Memory compression zswap/zram (separate, config-heavy) Integrated NUMA-aware zpool tier (Section 13)
CPU bandwidth guarantee No floor mechanism CBS-backed cpu.guarantee (Section 15)
Stable observability ABI Tracepoints are unstable Versioned, documented stable tracepoints (Section 40)
Verified boot chain Fragmented (UEFI SB + IMA + dm-verity) Unified chain from firmware to drivers (Section 22)
Kernel object introspection Per-subsystem (/proc, /sys, scattered) Unified object namespace via islefs (Section 41)
Driver state preservation Lost on crash — cold restart Checkpointed state buffer, warm restart (Section 9)
Core panic diagnostics kexec + kdump (complex setup) In-place crash dump to reserved memory (Section 9)
Context switch XSAVE cost Eager XSAVE with XSAVEOPT/XSAVES optimizations (skips unmodified components, but still saves full state for context switches involving SIMD). ISLE's lazy approach avoids save/restore entirely for non-SIMD threads. Lazy XSAVE — zero cost for non-SIMD threads (Section 14.6)
CPU errata management Scattered #ifdef, ad-hoc Structured quirk table + boot-param controls (Section 17.4)
Volume layer + driver crash Device marked failed, RAID resync Recovery-aware: pause I/O, resume clean (Section 29)
VM guest driver crash VM reboot required Driver recovers in-place, hypervisor unaware (Section 37)
Block storage networking Separate stacks (open-iscsi, nvme-cli, no unified recovery) Unified iSCSI + NVMe-oF with RDMA upgrade and crash recovery (Section 30)
Clustered FS + driver crash Node fenced, ejected from cluster Driver recovers in-place, node stays in cluster (Section 31)
Distributed locking TCP-based DLM (~10-100 μs/op depending on lock locality; local locks <1 μs), global recovery quiesce on any node failure RDMA-native DLM (~2-3 μs uncontested, ~5-10 μs contested), per-resource recovery, lease-based extension, batch ops (Section 31a)
TPM key management Userspace daemon (tpm2-abrmd) Kernel-native resource manager + capability integration (Section 23)
Runtime integrity IMA bolted onto VFS, optional Integrated with capability system and driver loading (Section 24)
Display stack crash X/Wayland session lost DMA-BUF survives driver reload, compositor stalls ~50-150ms (Section 46.2.6)

Appendix D. Cross-Feature Integration Map

D.1 Cross-Feature Integration Map

These features are not independent — they reinforce each other:

Formal verification (§50) ──────► Confidential computing (§26)
  Proves capability system correct     Relies on correct capability enforcement

Safe extensibility (§51) ◄──────► Live evolution (§52)
  Policy modules are hot-swappable     Evolution uses the same mechanism

Intent-based management (§53) ◄──► In-kernel inference (Section 45)
  Intent optimizer uses learned models  Models optimize for declared intents

EAS / heterogeneous CPU (Section 14.5) ◄──► Power budgeting (§49)
  EAS picks energy-optimal core         Power budget enforces watt cap

Power budgeting (§49) ◄──────► Intent-based management (§53)
  Power budget is a constraint          Intents include efficiency preference

Hardware memory safety (§19) ──────► Tier 1 driver isolation (Section 6)
  MTE catches C driver bugs             Domain isolation catches the resulting faults

Confidential computing (§26) ──────► Distributed kernel (Section 47)
  TEE-to-TEE RDMA                       DSM coherence for encrypted pages

Post-quantum crypto (§25) ──────► Distributed capabilities (Section 47.9)
  PQC signatures on capabilities        Network-portable across cluster

SmartNIC/DPU (§48) ◄──────► Distributed kernel (Section 47)
  DPU = close remote node               Same proxy driver pattern

Persistent memory (§32) ◄──────► Memory tiers (Section 43)
  Persistent memory = another tier      Managed by same PageLocationTracker

Computational storage (§33) ◄──► Accelerator framework (Section 42)
  CSD = storage accelerator             Same AccelBase vtable

Unified compute (§54) ◄──────► EAS / heterogeneous CPU (Section 14.5)
  Multi-dim capacity extends scalar      CPU capacity is a special case

Unified compute (§54) ◄──────► Accelerator scheduler (Section 42.2.4)
  Cross-device topology + energy data    Accel scheduler consumes advisory

Unified compute (§54) ◄──────► Power budgeting (§49)
  Workload profile drives throttle       Informed cross-device power decisions

Unified compute (§54) ◄──────► Intent-based management (§53)
  compute.weight feeds intent optimizer  Optimizer adjusts per-domain knobs

Unified compute (§54) ◄──────► Distributed kernel (Section 47)
  Peer kernel nodes via NodeTransport    Accelerator = close compute node

Unified compute (§54) ◄──────► SmartNIC/DPU offload (§48)
  Same convergence: device → peer node   NodeTransport unifies both transports

Distributed Lock Manager (§31a) ◄──► RDMA transport (Section 47.3)
  DLM uses RDMA CAS/Send for locks      Transport provides kernel RDMA API

Distributed Lock Manager (§31a) ◄──► Cluster membership (Section 47.11)
  DLM receives join/leave/dead events    Single heartbeat source for both

Distributed Lock Manager (§31a) ◄──► Clustered filesystems (Section 31)
  GFS2/OCFS2 use DLM for coordination   DLM lock modes map to FS operations

Distributed Lock Manager (§31a) ◄──► Driver recovery (Section 9)
  DLM in isle-core survives driver crash  No lock recovery needed on Tier 1 reload

Bootstrap Circular Dependency:

The intent optimizer (Section 53) uses in-kernel inference models (Section 45), but those models may not be loaded at early boot. Resolution: the intent optimizer degrades gracefully to static defaults when models are unavailable. At boot: 1. Intent optimizer starts with hardcoded heuristics (e.g., "latency target → raise cpu.weight by 20%"). 2. When the inference engine loads models (typically within seconds of boot), the optimizer transitions to learned optimization. 3. The transition is seamless — no reconfiguration needed.

D.2 Implementation Dependency Graph

Foundation (no dependencies):
  ├── Formal verification readiness (§50) — design methodology
  ├── Post-quantum crypto abstraction (§25) — data structure sizing
  └── Real-time preemption model (§16) — lock design

Early integration:
  ├── Hardware memory safety (§19) — needs memory allocator
  ├── Power budgeting (§49) — needs scheduler
  └── Safe extensibility (§51) — needs KABI vtable mechanism

Mid integration:
  ├── Confidential computing (§26) — needs memory manager, IOMMU
  ├── Intent-based management (§53) — needs inference engine, cgroups
  └── Live evolution (§52) — needs extensibility mechanism

Late integration:
  ├── SmartNIC/DPU offload (§48) — needs proxy driver, device registry
  ├── Persistent memory (§32) — needs VFS, memory tiers
  ├── Computational storage (§33) — needs AccelBase framework
  ├── Unified compute topology (§54) — needs AccelBase, EAS (Section 14.5), power budgeting (§49)
  └── Peer kernel nodes (§54.13) — needs unified compute + distributed kernel (Section 47)

Appendix E. Open Questions

The following cross-cutting items require further design work. Each is tracked as an open question with the affected sections and the specific decision to be made.

io_uring integration (affects §20.5, §26, §52, §48): - Registered buffers in confidential computing: io_uring pre-registers DMA buffers at setup time. When a VM runs under SEV-SNP, these buffers must be in shared (unencrypted) memory. Decision needed: register-time enforcement vs. lazy conversion. - State migration during live evolution: io_uring's SQ/CQ rings, registered files, and registered buffers constitute persistent state. The live evolution framework (§52) needs a StateSerializer for io_uring context. Decision needed: drain-and-recreate vs. in-place serialization. - DPU submission offload: DPUs can process io_uring submission queues directly, bypassing host CPU for network and storage operations. Decision needed: how the DPU reads SQ entries (shared memory mapping vs. DMA push) and how completions are posted back to CQ.

GPU virtualization (affects §42, §26): - Confidential GPU VMs require that GPU VRAM is encrypted and attestable. SEV-SNP does not natively protect PCIe device memory. TDX Connect (Intel) and ARM CCA device assignment are emerging but not yet stable. Decision needed: software bounce buffer path (safe, slow) vs. hardware-assisted device encryption (fast, hardware-dependent). - Nested virtualization with GPU passthrough: a confidential VM running a nested hypervisor that passes through a GPU adds three layers of IOMMU translation. Decision needed: whether to support this (performance may be prohibitive).

Testing strategy for cross-feature interactions (affects all §26-§56): - Combinatorial explosion: 15 features yield 105 pairwise interactions. Exhaustive testing is infeasible. Prioritized critical pairs: 1. RT + confidential computing (latency impact of memory encryption) 2. Power budgeting + intent optimization (conflicting objectives) 3. MTE + DSM page migration (tag preservation across RDMA transfer) 4. Live evolution + RT (component swap during hard-RT operation) 5. DPU offload + confidential computing (encrypted DPU-host channel) - Decision needed: test matrix, CI infrastructure, acceptance thresholds for each pair.

Secure boot measurement chain (affects §22, §26, §51, §52): - Live kernel evolution (§52) replaces kernel components at runtime. Each new component must be measured into a TPM PCR (or TDX MRTD) before activation to maintain the attestation chain. Decision needed: which PCR to extend (dedicated PCR for dynamic components vs. shared with static measurement), and how remote attestors distinguish "legitimate evolution" from "compromised kernel." - Policy modules (§51) loaded at runtime must also be measured. Decision needed: whether measurement is mandatory (blocks unsigned modules) or advisory (measure but allow).

CXL 3.0 fabric management (affects §47, §32, §54): - CXL 3.0 introduces fabric-attached memory with hardware-managed coherence. Decision needed: how this integrates with the distributed kernel's software DSM protocol (§47.5). Options: CXL replaces DSM for intra-rack, DSM remains for inter-rack; or DSM degrades gracefully when CXL is available.

Multi-architecture parity for advanced features (affects §18, §26-§56): - Many features in §26-§56 specify x86-64 mechanisms (WRPKRU, SEV-SNP, RAPL, MTE). ARM64, RISC-V, and PowerPC equivalents exist for some but not all. Partially addressed: §18 now includes an "Advanced Feature Architecture Parity" matrix covering 8 key features across all six architectures. Remaining decision: per-feature acceptance criteria for "software fallback" vs. "not supported" (performance thresholds, testing requirements).

eBPF verifier completeness (affects §20.4): - The Linux eBPF verifier has had multiple privilege escalation CVEs (CVE-2021-31440, CVE-2021-3490, CVE-2022-23222) stemming from incorrect bounds tracking, register state pruning, and speculative execution handling. ISLE's clean-room Rust reimplementation must achieve equivalent safety guarantees without inheriting these bug classes. Decision needed: formal verification target (full verifier including all helper function type-checking, or just the bounds-checking core and control flow analysis?), and what subset of eBPF features to support in Phase 2 (conservative: XDP and socket filters only) vs. Phase 3 (full Linux parity including tracing, cgroup, and struct_ops programs).

io_uring + SEV-SNP shared buffer management (affects §20.5, §26): - io_uring's registered buffer mechanism pre-pins buffers for zero-copy I/O. Under SEV-SNP, these buffers must reside in the shared (C-bit clear) memory region for device DMA access. However, shared memory is unencrypted and visible to the host VMM. Decision needed: per-buffer encryption/decryption at registration time using AES-GCM (performance cost: ~1 microsecond per 4 KiB page for encrypt + MAC, acceptable for storage but potentially significant for high-IOPS NVMe workloads), or accept that io_uring registered buffers are in plaintext. The plaintext approach is acceptable for block storage (ciphertext is on disk anyway) but not for network payloads containing secrets (TLS session keys, authentication tokens). A per-buffer policy flag (IORING_REGISTER_BUFFERS_ENCRYPTED) could let applications choose.

GPU VRAM encryption for confidential VMs (affects §42, §26): - NVIDIA H100 supports CC (Confidential Computing) mode with hardware-encrypted VRAM and attestable GPU firmware. AMD MI300X does not yet support VRAM encryption in SEV-SNP mode (GPU memory is outside the encrypted memory boundary). Decision needed: software fallback path where GPU computations operate on encrypted host memory via bounce buffers (10-100x slower due to PCIe round-trips and CPU-side encryption), or mark GPU passthrough as "not confidential" on AMD platforms until hardware support lands. A third option: restrict confidential GPU workloads to inference-only (model weights are public, only input/output needs encryption) and encrypt only the host-to-GPU and GPU-to-host transfer buffers.

CXL 3.0 coherence domain interaction with DSM (affects §47.5, §47.12): - CXL 3.0 Type 3 devices provide hardware-coherent shared memory between hosts via the CXL.mem protocol with back-invalidate support. This capability overlaps with the software DSM protocol defined in Section 47.5. Decision needed: when CXL 3.0 fabric is available between two nodes, should the DSM protocol defer entirely to CXL hardware coherence (simpler, lower latency at ~200ns vs. ~5 microseconds for software DSM, but limited to CXL-connected nodes within a single rack), or should DSM provide a unified abstraction that uses CXL as a fast transport underneath (more complex, but uniform API across CXL and non-CXL nodes)? The hybrid approach adds a transport-selection layer to DSM that routes coherence traffic over CXL when available and falls back to RDMA otherwise.

Live Evolution attestation chain (affects §52, §22): - When a kernel component is hot-swapped via live evolution (Section 52), the TPM measurement chain (Section 22) must be updated to reflect the new component. Decision needed: extend a dedicated PCR (PCR 14?) with the hash of each new component as it is loaded, or re-measure the entire kernel image into a single PCR. The former is simpler and incremental but creates a growing measurement log that remote attestors must replay to verify; the latter is cleaner for attestation (single expected PCR value per known kernel configuration) but requires knowing the full kernel composition at re-measurement time. A related question: should the attestation quote include a manifest of all live-evolved components and their versions, separate from PCR values?


This document is the canonical reference for ISLE development. All implementation decisions must be traceable to the architecture described here. Changes to this document require team review and approval.