Sections 55–58 and Appendices A–E of the ISLE Architecture. For the full table of contents, see README.md.
Part XIV: Strategy and Roadmap
Ecosystem strategy, implementation phases, verification, and technical risks.
55. Driver Ecosystem Strategy
55.1 The Challenge
Driver coverage is the single largest adoption blocker for any new kernel. Linux has thousands of drivers covering decades of hardware. ISLE cannot replicate this overnight.
55.2 Agentic Driver Rewrite Project
The key insight: all open-source Linux driver source code is available. The hardware programming logic (register sequences, DMA setup, interrupt handling) is identical regardless of kernel API. Only the kernel-facing API surface changes.
AI-assisted translation pipeline:
Input: Linux driver C source code (GPL, ~500-5000 LOC typical)
|
v
Step 1: Parse Linux kernel API calls (kmalloc, dma_alloc_coherent,
request_irq, pci_read_config_*, etc.)
|
v
Step 2: Map to KABI equivalents (KernelServicesVTable methods)
|
v
Step 3: Translate C to Rust, preserving hardware-specific logic exactly
|
v
Step 4: Generate KABI driver entry point and vtable exchange
|
v
Output: Native Rust KABI driver
Human review: Verify hardware-specific sequences are preserved
Testing: Against real hardware + QEMU virtual devices
55.3 Prioritized Driver List
These drivers cover approximately 95% of real hardware in server and desktop environments:
Priority 1 -- Cloud/VM (covers 100% of cloud deployments):
1. VirtIO block (virtio-blk)
2. VirtIO network (virtio-net)
3. VirtIO GPU (virtio-gpu)
4. VirtIO console (virtio-console)
Priority 2 -- Storage (covers 99% of bare-metal storage): 5. NVMe (universal modern SSD interface) 6. AHCI/SATA (legacy HDDs and older SSDs)
Priority 3 -- Networking (covers 90% of server NICs): 7. Intel e1000/e1000e (universal VM and consumer NIC) 8. Intel igb/ixgbe/ice (server 1G/10G/25G/100G) 9. Realtek r8169 (consumer Ethernet) 10. Mellanox mlx5 (high-performance datacenter)
Priority 4 -- Human Interface (covers desktop usability): 11. USB XHCI host controller (all modern USB) 12. USB EHCI host controller (USB 2.0 legacy) 13. USB HID (keyboard, mouse) 14. USB mass storage 15. Intel HDA audio 16. i915 (Intel integrated graphics, modesetting) 17. amdgpu (AMD graphics, modesetting) 18. UVC (USB Video Class) camera driver (Phase 3/4)
Priority 5 -- Platform (covers system management): 18. ACPI subsystem 19. PCI/PCIe enumeration and configuration 20. IOMMU (Intel VT-d, AMD-Vi)
55.4 Nvidia / Proprietary Driver Strategy
For Nvidia (the most critical proprietary driver):
- Nvidia's driver already has a clean internal abstraction layer between their
proprietary GPU core and the "kernel interface layer" (
nvidia.ko) - ISLE provides a KABI-native implementation of this kernel interface layer
- Nvidia's proprietary compute core links against our KABI implementation
- This is more sustainable than binary
.kocompatibility: the interface layer is small, well-defined, and stable
55.5 Community Incentive
The clean KABI SDK makes driver development significantly easier than Linux: - No need to track unstable internal APIs - Rust safety eliminates entire classes of bugs - Binary compatibility across kernel versions eliminates recompilation burden - Clear, documented interfaces reduce the learning curve
This lower barrier to entry is expected to attract contributors and vendors over time.
56. Implementation Phases
This section covers the implementation timeline for all features. The first part (Phases 1-5+) defines core kernel milestones. The Enhancement Feature Phasing and Future-Proof Feature Phasing tables below map additional features onto these same phases.
Phase 1: Foundations
Goal: Boot to a hello-world program.
- ISLE Core: x86-64 boot (UEFI + BIOS), physical memory allocator, basic scheduler, IPC/isolation domain infrastructure
- LinuxCompat: minimal syscalls for
execve+write+exit_group - Tier 0 drivers: APIC, timer, serial console
- Build system: Cargo workspace with custom target spec, linker scripts
- CI/CD: QEMU-based boot tests on every commit
- KABI compiler:
.kabiIDL parser and Rust/C code generator (v1)
Exit criteria: A statically linked 'Hello, world!' ELF binary runs on ISLE in QEMU. The KABI compiler
successfully parses a minimal .kabi IDL file and generates Rust/C stubs that compile
without errors.
Phase 2: Self-Hosting Shell
Goal: Run a busybox shell with basic utilities.
- VFS layer: mount table, path resolution, file descriptor table
- Filesystems: tmpfs, initramfs (cpio), procfs (basic), sysfs (stub)
- Block I/O layer + VirtIO-blk driver (Tier 1)
- Memory manager:
mmap,brk, page fault handler, COW, demand paging - Process management:
fork/clone,execve,wait,exit - Basic signal handling:
SIGCHLD,SIGKILL,SIGTERM,SIGSEGV - Pipe and simple I/O
Exit criteria: Busybox shell boots, ls, cat, echo, ps work.
Phase 3: Real Workloads
Goal: Boot systemd, run Docker containers.
- Full syscall coverage: approximately 330+ commonly used syscalls (from a dispatch table covering ~450 total Linux syscall numbers, with uncommon/obsolete syscalls returning -ENOSYS)
- NVMe driver (Tier 1), ext4 filesystem (read-write)
- Network: VirtIO-net driver, e1000 driver, TCP/IP stack, socket API
- Namespaces: all 8 types
- Cgroups: v2 (primary) + v1 compat
- io_uring: full implementation
- eBPF: verifier + JIT (x86-64) + core map types + XDP/tc/kprobe programs
- seccomp-bpf: for container runtime compatibility
- Full signal handling: all 64 signals,
sigaction,sigaltstack - TTY/PTY subsystem: for terminal emulators
Exit criteria: Ubuntu minimal boots with systemd, Docker runs hello-world
container, iperf3 and fio benchmarks complete.
Phase 4: Production Ready
Goal: Drop-in replacement for specific workloads.
- KVM hypervisor:
/dev/kvm, VMX, EPT, QEMU/Firecracker support - Netfilter/nftables: connection tracking, NAT, Docker networking
- LSM framework: SELinux policy engine, AppArmor profiles
- Agentic driver rewrite: top-20 driver families ported
- Performance tuning: reach within 5% of Linux on all target benchmarks
- Crash recovery: full Tier 1/2 fault injection testing
- Package:
.deband.rpmpackages for Ubuntu 24.04+ and Fedora 40+ - LTP conformance: Linux Test Project suite passing (>95% of applicable tests)
Exit criteria: ISLE boots unmodified Ubuntu 24.04 and Fedora 40, runs Docker + Kubernetes single-node, passes LTP, within 5% of Linux on benchmarks.
Phase 5: Ecosystem
Goal: Broad adoption and platform maturity.
- ARM64 port: full Tier 1 isolation using architecture-appropriate mechanisms
- RISC-V 64 port: same
- PPC32 port: embedded PowerPC support with segment-register isolation
- PPC64LE port: IBM POWER server support with Radix MMU isolation
- Extended driver coverage: GPU acceleration (i915, amdgpu compute), WiFi, Bluetooth
- Vendor partnerships: Nvidia KABI driver, AMD KABI driver, Intel KABI driver
- Community driver development: SDK documentation, examples, mentorship
- Distribution certification: RHEL, Ubuntu, SUSE official support
- Nested virtualization: KVM-on-KVM
- Live kernel upgrade: stop all Tier 1/2 drivers, swap core, restart drivers
Enhancement Feature Phasing
The kernel-internal enhancements described in Sections 13, 15, 22, and 39-41 have different urgency levels relative to the phases above:
| Feature | Earliest Phase | Rationale |
|---|---|---|
| Unified Object Namespace (Section 41) | Phase 1-2 | Foundational — other features build on it |
| Stable Tracepoints (Section 40) | Phase 2 | Needed for debugging from the start |
| Memory Compression (Section 13) | Phase 3 | Requires mature memory manager |
| Verified Boot (Section 22) | Phase 3 | Requires bootable system to protect |
| CPU Bandwidth Guarantees (Section 15) | Phase 3-4 | Requires mature scheduler + cgroups |
| Fault Management (Section 39) | Phase 4 | Requires mature driver ecosystem reporting health |
The following table covers the implementation timeline for advanced features (Parts XI-XIV). Phase numbers align with the core kernel phases defined above. "Design-In" items (Phase 1) require data structure reservations and trait definitions but no functional implementation. Higher-phase items depend on core infrastructure being available.
| Feature | Phase | Dependencies | Design-In Cost | Notes |
|---|---|---|---|---|
| PQC crypto abstraction (§25) | Phase 1 | None | Low | Variable-length signature fields, algorithm enum |
| Formal verification readiness (§50) | Phase 1 | None | Low | Spec annotations, design contracts |
| RT preemption model (§16) | Phase 1-2 | Scheduler | Medium | Lock design, interrupt threading |
| Hardware memory safety hooks (§19) | Phase 2 | Memory allocator | Low | Tag allocation/deallocation in slab/buddy |
| Power budgeting (§49) | Phase 3 | Scheduler, cgroups | Medium | RAPL/SCMI reading, power cgroup controller. Per-task EAS is in Section 14.5 |
| Safe kernel extensibility (§51) | Phase 3 | KABI, domain isolation | Medium | Policy vtable traits, module lifecycle |
| Confidential computing — guest (§26) | Phase 3 | Memory manager | Medium | Bounce buffers, shared/private pages |
| Confidential computing — host (§26) | Phase 4 | isle-kvm, IOMMU | Medium | SEV-SNP/TDX VM management |
| PQC algorithm implementations (§25) | Phase 3-4 | Crypto abstraction | Medium | ML-KEM, ML-DSA, hybrid mode |
| Live kernel evolution (§52) | Phase 4-5 | Extensibility | Medium | State export/import, atomic swap |
| Intent-based management (§53) | Phase 4-5 | Inference engine, cgroups | Medium | Optimization loop, intent cgroup knobs |
| SmartNIC/DPU offload (§48) | Phase 4-5 | Device registry, proxy drivers | Medium | Offload transport, DPU discovery |
| Persistent memory (§32) | Phase 4-5 | VFS, memory tiers | Medium | DAX, MAP_SYNC, CLWB fencing |
| Computational storage (§33) | Phase 5+ | AccelBase framework | Low | CSD as AccelDeviceClass |
| Unified compute topology (§54) | Phase 4-5 | AccelBase, EAS (Section 14.5), power budgeting (§49) | Low | Advisory overlay; multi-dim capacity profiles, cross-device energy |
| Unified cgroup compute.weight (§54) | Phase 5+ | Unified topology, intent optimizer (§53) | Low | Optional knob; orchestration layer over existing per-domain knobs |
| NodeTransport unification (§54.13) | Phase 5 | KernelTransport (Section 47), OffloadTransport (§48) | Medium | Merge RDMA + PCIe + NVLink + CXL into one transport abstraction |
| Peer kernel nodes (§54.13) | Phase 5+ | NodeTransport, distributed kernel (Section 47) | Low | Vendor-driven; architecture ready, adoption depends on industry |
Priority Rationale
Phase 1-2 (Design-In): PQC sizing, verification readiness, RT lock design. These cost almost nothing now but are impossible to retrofit. Design contracts and data structure sizes affect everything built on top.
Phase 3 (Real Workloads): Extensibility, power budgeting, confidential guest mode. These enable the kernel to run real workloads in modern environments (cloud, power- constrained datacenters).
Phase 4-5 (Competitive Advantage): Live evolution, intent-based management, DPU offload. These are features that Linux cannot provide due to architectural constraints. They differentiate ISLE in production environments.
56.1 Licensing Summary
| Component | IP Source | Risk |
|---|---|---|
| Confidential computing (TEE) | Hardware vendor specs (AMD SEV, Intel TDX, ARM CCA), all public | None |
| Post-quantum crypto | NIST standards (FIPS 203, 204, 205), public domain algorithms | None |
| Power budgeting | RAPL (Intel public spec), SCMI (ARM public spec), original design | None |
| Hardware memory safety | ARM MTE (public ISA), Intel LAM (public ISA) | None |
| Formal verification | Verus (MIT license), RustBelt (academic, published) | None |
| Safe extensibility | Original design (extends existing KABI vtable model) | None |
| Live kernel evolution | Theseus OS concepts (academic, published, Rice University) | None |
| Intent-based management | Original design, optimization theory (academic) | None |
| Real-time guarantees | PREEMPT_RT concepts (GPLv2, Linux mainlined), CBS (academic) | Medium — see note below |
| SmartNIC/DPU offload | Original design (extends existing KABI proxy model) | None |
| Persistent memory | DAX/PMEM specifications (SNIA, public), Linux interfaces (facts) | None |
| Computational storage | NVMe Computational Programs Command Set and Subsystem Local Memory Command Set (public, NVMe consortium, January 2024) | None |
| Unified compute model | Original design (extends existing AccelBase + EAS models) | None |
All components are either original design, based on published academic research, based on public hardware specifications, or based on NIST/industry standards. No vendor-proprietary APIs or patented algorithms.
PREEMPT_RT derivative risk: PREEMPT_RT is GPLv2 and was merged into Linux mainline (v6.12). Any ISLE real-time code derived from PREEMPT_RT implementation (as opposed to the general concepts of preemptible kernels, threaded interrupts, and priority inheritance) could carry GPLv2 obligations that conflict with OKLF's additional permissions. ISLE's RT implementation MUST be a clean-room design based on published academic literature (priority inheritance protocols: Sha, Rajkumar, Lehoczky 1990; CBS: Abeni and Buttazzo 1998; LITMUS-RT: Brandenburg 2011) and public OS design textbooks, not derived from Linux PREEMPT_RT source code. Code review must verify no Linux-derived lock conversion patterns, interrupt threading structures, or RT-specific scheduler modifications are copied.
56.2 Performance Impact Summary
Every feature in this document was evaluated against the constraint: "Does this make ISLE measurably slower than Linux on the same workload?"
| Feature | Hot-Path Impact vs Linux | Justification |
|---|---|---|
| Confidential computing | 0% (same hardware, same cost) | Hardware AES engine, identical to Linux |
| Post-quantum crypto | 0% (cold-path only) | Boot/driver-load only. ML-DSA-44 verify comparable to Ed25519; ML-DSA-65 verify ~100-200 µs (cold-path only, not on hot paths) |
| Power budgeting | 0.015% (MSR reads at tick) | 600ns per 4ms tick. Invisible in any benchmark. Per-task EAS overhead: see Section 14.5.12 |
| Hardware memory safety | 0% vs Linux when enabled | Same MTE instructions, same hardware cost. Tag RAM overhead: 3.125% of DRAM (ARM MTE only) |
| Formal verification | 0.000% (compile-time) | Not in the binary |
| Safe extensibility | 0% (same as Linux sched_class) | Function pointer dispatch, same mechanism |
| Live kernel evolution | 0.000% (rare event only) | ~10μs during replacement, months between events |
| Intent-based management | ~0.00005% (background only) | 3μs per second background optimization |
| Real-time guarantees | 0% to 5% (configurable) | Same cost as Linux PREEMPT_RT when enabled. 0% = Voluntary, ~1% = Full, 2-5% = Realtime |
| SmartNIC/DPU offload | Negative (faster) | Moves work OFF host CPU |
| Persistent memory | Negative (faster) | DAX eliminates page cache copies |
| Computational storage | Negative (faster) | CSD reduces data movement |
| Unified compute model | ~0.00005% (background only) | ~4μs/sec/cgroup advisory. Submission hot path unchanged |
Target: match or exceed Linux performance for all common workloads. Most features are invisible at steady state, and several actually improve performance. Known exceptions are conscious trade-offs documented in their respective sections: RT scheduling adds 0-5% overhead for RT-class tasks (same cost as Linux PREEMPT_RT); capability checks add ~5-10 cycles per privileged operation (~0.1%, fully pipelined bitmask test); untrusted policy module isolation adds ~46 cycles per domain crossing (eliminated once the module graduates to the Core isolation domain).
56.3 Consumer and Desktop Phases (Phase 5)
Phase 5 focuses on consumer hardware support, desktop integration, and application ecosystem compatibility. These phases begin after Phase 4 server/cloud stability.
56.4 Phase 5a: Essential Consumer Hardware
Goal: ISLE boots and runs on common Intel/AMD laptops with basic functionality.
Deliverables: - WiFi drivers (Intel, Realtek) - Bluetooth stack (HID, audio) - Touchpad drivers (I2C-HID, PS/2) - Audio (Intel HDA, USB Audio) - Graphics (Intel i915 modesetting, basic AMD) - S3 suspend/resume
56.5 Phase 5b: Consumer Power Management
Goal: Battery life competitive with existing Linux distributions.
Deliverables: - Power profiles (performance, balanced, battery-saver) - S0ix Modern Standby support - Per-app power attribution kernel interfaces
56.6 Phase 5c: Desktop Integration
Goal: Polished desktop experience, ready for enthusiast adoption.
Deliverables: - Wayland compositor support (DRM, input events) - Multi-monitor support (hotplug) - Desktop notifications (battery, network, USB events) - Per-app sandboxing capability primitives
56.7 Phase 5d: Broader Hardware
Goal: Support popular consumer laptops (ThinkPad, XPS, etc.).
Deliverables: - More WiFi chipsets (Qualcomm, Mediatek, Broadcom) - AMD graphics (amdgpu modesetting) - Thunderbolt 3/4 support - USB4 support - SATA, eMMC, SD card readers
56.8 Phase 5e: Gaming & Creative
Goal: Support gaming, content creation workloads.
Deliverables: - Vulkan drivers (Mesa RADV for AMD, Intel ANV) - Steam + Proton support - GPU video encode/decode (hardware acceleration)
56.9 Desktop / Laptop Performance Targets
Performance targets for ISLE running on consumer-grade desktop/laptop hardware. These are acceptance criteria for Phase 5 completion, not kernel architectural constraints — specific numbers are deployment-profile goals.
| Metric | Target |
|---|---|
| Kernel boot (bootloader → login screen) | < 5 seconds |
| Resume from S3 suspend | < 2 seconds |
| Resume from S4 hibernate | < 10 seconds |
| Idle power (WiFi on, display on) | Match or exceed Ubuntu 24.04 |
| Video playback (1080p H.264) | Hardware decode; CPU < 5% |
56.9.1 Validation Methodology
Battery life: - Side-by-side comparison with Windows 11 and Ubuntu 24.04 on the same hardware - Standardised web-browsing benchmark (Speedometer + video stream) - ISLE must match or exceed Ubuntu 24.04 battery life
Real-world validation: - 100+ beta testers (developer community) running ISLE as daily driver - 30-day soak; collect crash dumps, performance traces, battery statistics
57. Verification Strategy
57.1 Testing Layers
| Layer | Tool / Method | What it verifies |
|---|---|---|
| Unit tests | cargo test (in QEMU or host mock) |
Individual subsystem correctness |
| Integration tests | Custom test harness in QEMU | Cross-subsystem interactions |
| Syscall conformance | Linux Test Project (LTP) | Syscall behavior matches Linux |
| Application testing | Boot Ubuntu minimal, Alpine | Real-world application compatibility |
| Container testing | Docker hello-world, nginx, redis | Container runtime compatibility |
| Kubernetes testing | k3s single-node | Orchestration platform compatibility |
| ABI regression | kabi-compat-check in CI |
No breaking changes to KABI |
| Crash recovery | Fault injection framework | Tier 1/2 drivers recover correctly |
| Performance regression | Automated benchmarks vs Linux baseline | No unacceptable performance regression |
| Fuzzing | syzkaller (adapted for ISLE) | Syscall fuzzing for crash/hang detection |
| Static analysis | cargo clippy, custom lints |
Code quality, unsafe usage review |
57.2 Key Benchmarks
These benchmarks must match Linux within 5% (measured on identical hardware, same kernel configuration, same workload parameters):
| Benchmark | What it tests | Target delta |
|---|---|---|
fio randread 4K QD32 |
Block I/O fast path (IOPS) | < 2% |
fio randwrite 4K QD32 |
Block I/O write path (IOPS) | < 2% |
fio sequential read 1M |
Block I/O throughput (GB/s) | < 1% |
iperf3 TCP throughput |
Network stack throughput | < 5% |
iperf3 TCP latency (RR) |
Network stack latency | < 5% |
nginx small-file HTTP (wrk) |
Combined network + filesystem | < 5% |
redis-benchmark |
In-memory key-value (network + mem) | < 3% |
sysbench OLTP read-write |
Database workload (IO + CPU + sched) | < 5% |
hackbench (groups=100) |
Scheduler + IPC throughput | < 3% |
lmbench lat_ctx |
Context switch latency | < 1% |
Kernel compile (make -jN) |
Combined CPU + IO + scheduling | < 5% |
stress-ng mixed |
Overall system stress | < 5% |
57.3 Crash Recovery Testing
Dedicated fault injection framework:
- Domain isolation violation: Trigger write to wrong domain from Tier 1 driver
- Null pointer dereference: In Tier 1 and Tier 2 drivers
- Infinite loop: Verify watchdog detects and kills Tier 1 driver
- DMA to wrong address: Verify IOMMU blocks it
- Driver process crash: Verify Tier 2 supervisor restarts it
- Repeated crashes: Verify auto-demotion policy engages
- I/O in flight during crash: Verify all pending requests complete with -EIO
Each test verifies: (1) the system does not panic, (2) the driver recovers within the target time, (3) applications see errors but can retry, and (4) no memory is leaked.
57.4 CI Pipeline
Every commit triggers:
1. cargo build --target x86_64-unknown-none
2. cargo test (host-side unit tests)
3. QEMU boot test (basic boot + shutdown)
4. kabi-compat-check (no ABI breaks)
5. cargo clippy (lint pass)
6. cargo fmt --check (formatting)
Every merge to main additionally triggers:
7. LTP syscall conformance suite
8. Docker container boot test
9. Performance benchmark suite (vs stored Linux baseline)
10. Crash recovery fault injection suite
58. Technical Risks
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| MPK provides only 16 domains | Medium | Certain | Group related drivers by fault domain (all block share domain, all net share domain). 12 driver-available domains on x86 (4 keys reserved for infrastructure: PKEY 0=core, 1=shared descriptors, 14=shared DMA, 15=guard; per Section 3/5). AArch64 POE has 7 usable indices (1-7), of which 3 are available for Tier 1 driver domains (indices 3-5; indices 1-2 reserved for isle-core, 6 for userspace, 7 for temporary/debug; per Section 58.3). See "MPK Domain Grouping" below for degraded isolation analysis. |
| eBPF verifier complexity | High | High | Verifier subsystem is ~30K SLOC in Linux (kernel/bpf/verifier.c core is ~20K SLOC, plus ancillary analysis passes). Start with subset of program types, expand incrementally. ISLE implements a clean-room Rust verifier and JIT (GPL avoidance); the eBPF bytecode format and helper API are compatible with Linux but the implementation is original. |
| KVM deeply integrated with Linux MM | High | High | Design memory manager with KVM hooks from the start (Phase 1 architecture). Dedicate a team to KVM from Phase 3. |
| Driver coverage gap blocks adoption | Critical | High | Cloud-first strategy (VirtIO covers 100% of VMs). Prioritize top-20 drivers. Agentic rewrite pipeline for open-source drivers. |
| Subtle syscall compatibility bugs | High | High | LTP conformance suite, real-world application testing, syzkaller fuzzing. Build a comprehensive test matrix of applications. |
| Spectre/Meltdown mitigations + domain isolation | Medium | Medium | KPTI not needed for Tier 1 (same Ring 0). Tier 2 needs standard KPTI. Retpoline/IBRS for indirect branches. Test on affected hardware. |
| IOMMU not available on all hardware | Medium | Medium | IOMMU required for Tier 1 DMA fencing. Systems without IOMMU fall back to trusted mode (reduced isolation, logged warning). |
| ARM64 lacks direct MPK equivalent | Medium | Certain | Use POE (FEAT_S1POE, 7 usable indices of which 3 are for Tier 1 drivers, optional from ARMv8.9+) or page-table fallback. Adaptive isolation policy (Section 3) allows per-driver tier pinning or promotion to Tier 0 on pre-POE hardware. |
| No fast isolation on pre-2020 x86 | Medium | Certain | Adaptive isolation policy: isolation=performance promotes Tier 1 to Tier 0 (Linux-equivalent speed, no memory isolation). IOMMU DMA fencing still active. |
| Rust ecosystem maturity for OS dev | Low | Medium | Established patterns from Redox, Linux rust-for-linux, Hubris. Use #![no_std] and custom allocator. Unsafe blocks at hardware boundaries are expected and audited. |
| Performance target too ambitious | Medium | Medium | 5% target is for macro benchmarks. Micro-benchmarks may show higher overhead on specific paths. Batch amortization and careful profiling. |
| Community adoption / contributor pipeline | Medium | Medium | Clean SDK, good documentation, lower barrier than Linux driver development. Cloud-first focus builds credibility before desktop push. |
| Regulatory / certification barriers | Low | Low | Work with distributions early. Open-source everything except vendor proprietary blobs. |
| LZ4/Zstd kernel implementation correctness | Medium | Medium | Fuzzing, comparison with reference implementation. Use no_std BSD-licensed implementations with comprehensive test vectors. |
| Object namespace overhead on hot paths | Low | Low | Lazy registration for high-frequency objects (fds, sockets, VMAs). Eagerly registered objects only (~2000 baseline = ~384 KB). |
| CBS scheduling fairness under edge cases | Medium | Medium | Formal analysis against CBS paper (Abeni 1998), stress testing with adversarial workloads, comparison with Linux cpu.max behavior. |
58.1 Risks from Advanced Features (Parts XI-XIV)
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| TEE hardware fragmentation (SEV-SNP vs TDX vs CCA) | High | Certain | Abstract behind ConfidentialContext trait (Section 26.3). Implement one backend at a time. SEV-SNP first (largest cloud deployment), TDX second, CCA third. |
| PQC algorithm instability (NIST may revise) | Medium | Medium | Algorithm-agile abstraction (Section 25.2). Algorithms behind enum dispatch; swapping ML-KEM for a successor is a library update, not a kernel redesign. |
| PQC signature sizes impact IPC latency | Low | Certain | ML-DSA-65 signatures are 3,309 bytes (per NIST FIPS 204, Table 2). Cold-path only (capability minting, not every IPC call). SignatureData::Heap variant avoids ring buffer bloat (Section 25). |
| RT + domain isolation interaction causes priority inversion | High | Medium | Domain switch (WRPKRU on x86) is ~23 cycles (no lock needed). Domain switching is O(1) — no contention path. If priority inheritance needed for domain-shared buffers, use PI futexes (Section 16.3). |
| Formal verification scope creep | Medium | Medium | Verify only security-critical paths: capability table, IPC ring, page table mapping (Section 50). Accept that ~80% of kernel code is tested, not verified. |
| DPU vendor lock-in (proprietary firmware) | Medium | High | KABI vtable for OffloadTransport (Section 48). DPU-specific code is behind the same driver isolation as any Tier 1 device. Vendor-specific logic in driver, not kernel. |
| PMEM/CXL hardware not yet widely deployed | Low | High | Design is hardware-agnostic (Section 32). All PMEM code compiles out when hardware is absent. CXL 3.0 adoption expected 2025-2027; architecture ready, implementation deferred. |
| Unified compute model adds scheduling overhead | Medium | Low | Advisory overlay only — existing schedulers unchanged (Section 54). Topology queries are O(1) reads from cached ComputeCapacityProfile. No hot-path cost. |
| Live kernel evolution causes state corruption | Critical | Low | Post-swap watchdog with 5-second timer (Section 52). On crash, the system attempts to re-extract state from the failing component; if extraction fails, the system panics rather than reverting to stale state, preventing silent data corruption. State serialization uses versioned HMAC integrity tags. |
| Intent optimizer makes poor decisions | Low | Medium | Intent system is purely advisory (Section 53). Clamping prevents invalid resource configs. Worst case: system falls back to static defaults (no intent optimization). |
58.2 Risk Response Priority
- Driver coverage (Critical): Addressed by cloud-first strategy + agentic rewrite
- Syscall compatibility (High): Addressed by LTP + application test matrix
- eBPF complexity (High): Addressed by incremental implementation
- KVM integration (High): Addressed by early architectural planning
- TEE fragmentation (High): Addressed by trait-based abstraction
- RT + domain isolation interaction (High): Addressed by O(1) domain switching design
- Domain limit (Medium): Addressed by driver grouping policy
- Live evolution safety (Critical but low likelihood): Addressed by watchdog + state HMAC integrity checks
58.3 Domain Grouping: Degraded Isolation Analysis
When more than 12 Tier 1 drivers are loaded simultaneously, some drivers must share an isolation domain (protection key). This is an inherent limitation of Intel's 16-key PKU design (16 keys minus PKEY 0 for isle-core, minus PKEY 1 for shared descriptors, minus PKEY 14 for shared DMA, minus PKEY 15 as guard = 12 usable). Grouping has concrete consequences for fault isolation:
What grouping preserves: - IOMMU isolation: each driver retains its own IOMMU domain regardless of domain grouping. DMA fencing is unaffected — a crashing NVMe driver cannot DMA into a NIC driver's buffers, even if they share an isolation domain. - Capability isolation: each driver has its own capability set. Sharing an isolation domain does not grant access to another driver's capabilities. - Crash detection: fault injection and page-fault trapping still identify the crashing driver (via instruction pointer, not isolation domain).
What grouping degrades: - Memory read/write isolation between grouped drivers. If drivers A and B share isolation domain 5, a buffer overrun in A can corrupt B's data structures. The crash is still contained (it cannot escape to isle-core or other domains), but it may take down both A and B. - The blast radius of a crash expands from one driver to one domain group. In practice, this means a faulty NVMe driver could take down the AHCI driver if both are in the "block" group.
Grouping policy — drivers are grouped by fault domain affinity (drivers that interact heavily and would likely cascade-fail anyway):
| Isolation Domain | Group | Typical Members |
|---|---|---|
| 2 | Block storage | NVMe, AHCI/SATA, virtio-blk, iSCSI, NVMe-oF |
| 3 | Network (data) | Intel NIC, Mellanox NIC, virtio-net |
| 4 | Network (stack) | TCP/IP, UDP, RDMA core |
| 5 | Filesystem | ext4, XFS, btrfs |
| 6 | Display | DRM/KMS, GPU compute |
| 7 | KVM | VMX/SVM, vhost-net, vhost-scsi |
| 8 | USB | xHCI, USB hub, USB mass storage |
| 9-13 | Individual | High-value drivers that warrant solo domains |
AArch64 domain budget: POE provides 7 usable indices (1-7; index 0 is reserved for isle-core as the default PTE value). Of the 7 usable indices, 4 are reserved for infrastructure (index 1 for shared read-only, index 2 for shared DMA, index 6 for userspace, index 7 for temporary/debug), leaving only 3 indices for Tier 1 driver domains (indices 3-5; vs. 12 on x86). AArch64 therefore requires much more aggressive grouping.
Note for reviewers: ARM FEAT_S1POE uses a 3-bit POIndex field in page table entries, providing 8 index values (0-7). This is a hardware constraint, not a design choice. Index 0 is the default PTE value (per ARM architecture), leaving 7 configurable indices. Do not suggest "use 4 bits for 16 indices" — the POIndex field width is fixed by the ISA. The grouping table above is designed for x86's 12-domain budget. On AArch64, the kernel applies a reduced grouping scheme: - Domain 0: isle-core (default PTE value) - Domain 1: Shared read-only - Domain 2: Shared DMA buffer pool - Domain 3: VFS + block I/O (merged — these are tightly coupled) - Domain 4: Network stack - Domain 5: All remaining Tier 1 drivers (single shared domain) - Domain 6: Userspace (EL0 default) - Domain 7: Temporary / debug This reduces isolation granularity for Tier 1 drivers on AArch64 (all share one domain) but preserves the critical isle-core/driver/userspace boundaries. The architecture-specific grouping is selected at boot based on
arch::current::isolation::domain_count().
Typical server scenario — a cloud server runs NVMe + NIC + TCP + KVM + virtio = 5 drivers. On x86 (12 driver domains), these fit in 5 domains with no grouping needed; the 12-domain limit only triggers on heavily-configured systems (desktop with GPU + audio + USB + Bluetooth + WiFi + NVMe + SATA + NIC + ...). On AArch64 with POE (3 driver domains), even this typical 5-driver configuration requires grouping -- the reduced scheme above merges block I/O, networking, and remaining drivers into 3 shared domains. Architectures with more domains (ARMv7 DACR: 15, PPC32 segments: 15) behave more like x86.
Monitoring — when grouping occurs, ISLE logs a warning:
isle: isolation domain 1 shared by nvme, ahci (reduced isolation: crash in either affects both)
This allows administrators to make informed decisions about which drivers to load as Tier 2 (full process isolation, unlimited domains) if they require stronger isolation than domain grouping provides.
Appendices
Reference material, comparison tables, and open questions.
Appendix A. Licensing Model: Open Kernel License Framework (OKLF) v1.3
ISLE uses the Open Kernel License Framework (OKLF) v1.3 (see OKLF-v1.3.md
for the full legal text). Key elements:
Base license: GPLv2-only with additional permissions (Sections 2-5 of OKLF). All kernel code — isle-core, isle-kernel, isle-compat, isle-net, isle-vfs, isle-block, isle-kvm, tools, and boot code — is GPLv2. This ensures: - All kernel modifications must be open-sourced - Proprietary forks are impossible - Same legal framework the Linux ecosystem understands
Approved Linking License Registry (ALLR): A curated, append-only list of open-source licenses approved for use with kernel code. Tiers 1-2 may link with kernel code directly (Tier 0/1 drivers). Tier 3 licenses are GPL-incompatible and may NOT link with the kernel; Tier 3 code runs exclusively as Tier 2 process-isolated drivers communicating via KABI IPC, where no linking occurs: - Tier 1 (weak copyleft, GPL-compatible): MPL-2.0, LGPL-2.1, EPL-2.0 (with Secondary License designation; see note below) - Tier 2 (permissive): MIT, BSD-2, BSD-3, Apache-2.0, ISC, Zlib - Tier 3 (incompatible — process isolation required, no linking): CDDL-1.0, CDDL-1.1, LGPL-3.0, EUPL-1.2 (see note below)
LGPL-3.0 incompatibility with GPLv2-only: LGPL-3.0 is incompatible with GPLv2-only code per the FSF compatibility matrix. LGPL-3.0 is defined as GPLv3 plus additional permissions (LGPL-3.0 Section 1: "This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License"). Since GPLv3 is incompatible with GPLv2-only (see GPLv3 exclusion note below), LGPL-3.0 inherits that incompatibility. LGPL-3.0 code must NOT be linked into the ISLE kernel. LGPL-3.0 code communicates with the kernel via KABI IPC only (Tier 3, process isolation required). Note that LGPL-2.1 IS compatible with GPLv2 and remains in Tier 1.
EUPL-1.2 classification (Tier 3): EUPL-1.2 is a strong copyleft license that the FSF classifies as GPL-incompatible. While EUPL Article 5 provides a compatibility list (including GPLv2, GPLv3, LGPL, AGPL, MPL-2.0, EPL-1.0, CeCILL) that allows EUPL-licensed code to be relicensed under those licenses when combined with code under those licenses, the FSF's position is that EUPL-1.2's copyleft is "comparable to the GPL's, and incompatible with it" by itself. ISLE places EUPL-1.2 in Tier 3 (process isolation required, no linking with kernel code) as the conservative default. EUPL-1.2 code that has been explicitly relicensed to GPLv2 via Article 5 by its copyright holder may then be treated as GPLv2 code and used in Tier 0/1. Without explicit relicensing, EUPL-1.2 code runs as a Tier 2 process-isolated driver communicating via KABI IPC only.
EPL-2.0 GPL compatibility: EPL-2.0 is GPL-compatible only when the distributor explicitly designates GPL as a Secondary License per EPL-2.0 §3.2. Without this designation, EPL-2.0 is GPL-incompatible. ISLE requires EPL-2.0 dependencies to carry the Secondary License designation; undesignated EPL-2.0 code is treated as Tier 3 (process isolation required, no linking with kernel code). ALLR Tier 1 inclusion applies only to EPL-2.0 code that explicitly carries the Secondary License designation for GPLv2. Enforcement: the KABI module loader checks for the Secondary License designation in the module's license metadata at load time. EPL-2.0 modules without the designation are rejected for Tier 0/1 loading and must run as Tier 2 process-isolated drivers. Additionally, EPL-2.0's patent grant (§2.2) requires contributors to grant a patent license for their contributions; ISLE cannot enforce this at a technical level, so EPL-2.0 code in Tier 1 carries an implicit assumption that upstream contributors have complied with §2.2. Code review should verify the Secondary License designation is present in the upstream project's license header, not just claimed in module metadata.
GPLv3 exclusion from ALLR: GPLv3 is deliberately excluded from the ALLR. ISLE's kernel is licensed GPLv2-only (not "GPLv2 or later"). GPLv3 is incompatible with GPLv2-only code per the FSF: GPLv3's additional requirements (anti-tivoization in Section 6, patent retaliation in Section 11) constitute "further restrictions" that GPLv2 Section 7 prohibits. Code licensed GPLv3-only cannot be linked into a GPLv2-only kernel. Code licensed "GPLv2 or later" CAN be used (under its GPLv2 grant), but code licensed GPLv3-only cannot. Adding GPLv3 to the ALLR would create a false impression that GPLv3-only code may be linked with the kernel. If GPLv3-only code must be used, it must run as a Tier 2 process-isolated driver (same as CDDL), communicating via KABI IPC with no linking.
CDDL and GPL incompatibility: CDDL is GPL-incompatible per the FSF. CDDL-licensed code may only run as Tier 2 drivers (separate address space, separate process). Despite CDDL appearing in the ALLR, no linking occurs between CDDL code and GPL kernel code. CDDL drivers communicate exclusively via KABI IPC (message passing across the process boundary), which provides the same license isolation as Linux's use of FUSE for ZFS-FUSE. In-kernel (Tier 0/1) CDDL code is NOT permitted. The KABI process boundary ensures the CDDL code and GPL code never form a single "work" in the copyright sense.
New licenses added via governance process (60-day review, supermajority LGB vote). Licenses are never removed (append-only for legal certainty).
Proprietary kernel-space code explicitly prohibited (OKLF Section 4.2(c)): Any code that loads into kernel address space and accesses internal kernel symbols is a derivative work and must comply with GPLv2 or an ALLR-listed license. This removes Linux's 30-year "gray area" about proprietary kernel modules.
Proprietary user-space drivers explicitly permitted (OKLF Section 4.2(b)): Code interacting with the kernel exclusively through the stable userspace interface (syscalls, /proc, /sys, VFIO, UIO, FUSE, eBPF) is not a derivative work. This maps directly to our Tier 2 driver model — hardware vendors who cannot open-source their drivers may use user-space driver frameworks with full isolation.
Anti-tivoization stance (OKLF Section 5): OKLF encourages but does not mandate installation information disclosure. The OKLF adds only additional permissions to GPLv2 (permitted by GPLv2 Sections 0 and 10), never additional restrictions. Anti-tivoization protection is achieved indirectly: the KABI stability guarantee means users can always replace a Tier 1/2 driver binary without modifying the kernel, making hardware lockdown of individual drivers less effective.
Firmware exception (OKLF Section 4.3): Binary firmware that runs on separate
processors (GPU microcode, Wi-Fi firmware, SSD firmware) is outside the license scope.
Distributed separately in firmware/. Code running on the main CPU is NOT firmware.
Legal risk acknowledgment — OKLF is a novel license framework built on GPLv2. While it is designed to be GPLv2-compatible (the "additional permissions" model is explicitly contemplated by GPLv2 §0 and §10), it has not been tested in court and constitutes a novel legal approach that should not be relied upon without independent legal review. Key risks: (1) the ALLR mechanism may be viewed by some lawyers as an untested extension of the "linking exception" concept — FSF/SFLC review is recommended before v1.0 final; (2) the OKLF provides weaker anti-tivoization protection than GPLv3, which is an accepted tradeoff for GPLv2 compatibility — OKLF cannot mandate installation information disclosure without violating GPLv2's "no further restrictions" clause; (3) ecosystem adoption depends on corporate legal teams accepting OKLF as GPLv2-compatible — even if legally sound, unfamiliarity may slow adoption; (4) the "additional permissions" model under GPLv2 §0/§10 is well-established in principle (e.g., GCC Runtime Library Exception, Qt commercial exception), but OKLF's scope (ALLR registry, driver tier classification, firmware exception) goes beyond typical additional permissions — a court could find that some OKLF provisions constitute "further restrictions" rather than "additional permissions," which GPLv2 §7 prohibits. This risk is mitigated by careful drafting but cannot be eliminated without judicial precedent. ISLE should seek early legal review from SFLC or equivalent, and provide a "plain GPLv2" fallback for organizations that cannot accept OKLF's additional terms.
KABI Driver SDK: The isle-driver-sdk crate (ABI type definitions, vtable layouts, ring buffer protocol, DMA types) is dual-licensed Apache-2.0 OR MIT. This is the interface contract — drivers of any ALLR-listed license can link against these types without friction.
How this maps to our driver tiers:
| Tier | Location | License requirement | OKLF section |
|---|---|---|---|
| Tier 0 (boot-critical) | In-kernel, static | GPLv2 or ALLR | 4.1 (in-tree) |
| Tier 1 (domain-isolated) | Ring 0, loaded | GPLv2 or ALLR | 4.2 (out-of-tree open-source) |
| Tier 2 (user-space) | Ring 3, process | Any (incl. proprietary) | 4.2(b) (userspace interface) |
Three ABI stability tiers (extending OKLF Section 3):
| Interface | Stable? | Policy |
|---|---|---|
| Internal kernel APIs | No | May change between any two releases |
| KABI (driver ABI) | Yes | Versioned, append-only, binary-stable |
| Userspace ABI (syscalls) | Yes | Never broken without extended deprecation |
| Concern | How addressed |
|---|---|
| Prevent proprietary kernel forks | GPLv2 copyleft |
| Allow ZFS (CDDL) | CDDL in ALLR Tier 3 — ZFS runs as a Tier 2 process-isolated driver (KABI IPC provides license boundary, no linking occurs) |
| Allow Nvidia GPU (proprietary) | Tier 2 user-space driver via VFIO |
| Allow BSD/MIT drivers | BSD/MIT in ALLR — full kernel-space access |
| Force kernel improvements to be open | GPLv2 copyleft on all kernel crates |
| Module enforcement | Kernel refuses non-compliant modules by default |
| Clear legal boundaries | OKLF explicit text, not legal gray area |
Appendix B. Project Structure
Note: This appendix describes the target project structure at full implementation. The current codebase (see CLAUDE.md "Project Structure") contains the foundational crates (
isle-kernel,isle-core,isle-driver-sdk,isle-compat,isle-net,isle-vfs,isle-block,isle-kvm). Additional crates listed below (e.g.,isle-accel,isle-cluster,drivers/) will be added as their corresponding architecture sections are implemented.
isle-kernel/
Cargo.toml # Workspace root (all crates)
ARCHITECTURE.md # This document
isle-core/ # Microkernel core
Cargo.toml
src/
main.rs # Boot entry point (calls arch-specific init)
cap/ # Capability system
mod.rs # Capability types, tables, operations
revocation.rs # Generation-based revocation
mem/ # Memory management
phys.rs # Physical page allocator (buddy)
vmm.rs # Virtual memory manager (maple tree, VMAs)
page_cache.rs # Page cache (RCU radix tree)
slab.rs # Slab allocator for kernel objects
pcid.rs # PCID/ASID management
huge.rs # Huge page (THP + explicit) support
sched/ # Scheduler
mod.rs # Scheduler core, class dispatch
cfs.rs # CFS/EEVDF fair scheduler
rt.rs # RT FIFO/RR scheduler
deadline.rs # Deadline (EDF/CBS) scheduler
balance.rs # NUMA-aware load balancer
ipc/ # IPC and isolation
mpk.rs # MPK domain management, WRPKRU helpers
ring.rs # Shared-memory ring buffers
tier2_ipc.rs # Cross-address-space IPC for Tier 2
arch/ # Architecture-specific Rust code
mod.rs # Architecture trait definitions
x86_64/ # x86-64 implementation
mod.rs
gdt.rs # GDT setup
idt.rs # IDT and interrupt dispatch
apic.rs # Local APIC driver (Tier 0)
timer.rs # HPET/TSC/APIC timer (Tier 0)
mpk.rs # MPK hardware interface
vmx.rs # VMX support for KVM
aarch64/ # ARM64 implementation (phase 2+)
mod.rs
armv7/ # ARMv7 implementation (phase 2+)
mod.rs
riscv64/ # RISC-V 64 implementation (phase 2+)
mod.rs
ppc32/ # PPC32 implementation (phase 2+)
mod.rs
ppc64le/ # PPC64LE implementation (phase 2+)
mod.rs
isle-compat/ # Linux syscall interface + compat shims
Cargo.toml
src/
syscall/ # ~450 syscall dispatch table
mod.rs # SyscallHandler enum, dispatch table
process.rs # fork, clone, execve, exit, wait
file.rs # open, read, write, close, ioctl
memory.rs # mmap, brk, mprotect, madvise
network.rs # socket, bind, listen, accept, connect
time.rs # clock_gettime, nanosleep, timer_*
misc.rs # getpid, getuid, uname, sysinfo
proc/ # /proc filesystem emulation
mod.rs
meminfo.rs # /proc/meminfo
cpuinfo.rs # /proc/cpuinfo
pid.rs # /proc/[pid]/* (maps, status, fd, etc.)
sys.rs # /proc/sys/* (sysctl interface)
sys/ # /sys filesystem emulation
mod.rs
devices.rs # /sys/devices/ device tree
class.rs # /sys/class/ device classes
bus.rs # /sys/bus/ bus enumeration
dev/ # /dev filesystem emulation
mod.rs
devtmpfs.rs # devtmpfs-compatible device nodes
signal/ # Signal handling
mod.rs
delivery.rs # Signal delivery to user space
handlers.rs # Default handlers, core dump
namespace/ # Linux namespace implementation
mod.rs
mnt.rs # Mount namespace
pid.rs # PID namespace
net.rs # Network namespace
user.rs # User namespace
ipc.rs # IPC namespace
uts.rs # UTS namespace
cgroup.rs # Cgroup namespace
time.rs # Time namespace
cgroup/ # Cgroup v1/v2
mod.rs
v2.rs # Unified hierarchy (primary)
v1_compat.rs # Legacy hierarchy (compatibility)
controllers/ # cpu, memory, io, pids, etc.
io_uring/ # io_uring subsystem
mod.rs
ring.rs # SQ/CQ ring management
sqpoll.rs # SQPOLL kernel thread
ops.rs # Operation dispatch
lsm/ # Linux Security Modules
mod.rs
hooks.rs # Hook framework
selinux.rs # SELinux policy engine
apparmor.rs # AppArmor profile engine
seccomp.rs # seccomp-bpf filter
ebpf/ # eBPF subsystem
mod.rs
vm.rs # eBPF virtual machine
verifier.rs # Static verifier
jit/ # JIT compilers
x86_64.rs
aarch64.rs
armv7.rs
riscv64.rs
ppc32.rs
ppc64le.rs
maps.rs # Map types (hash, array, ringbuf, etc.)
helpers.rs # eBPF helper functions
programs.rs # Program types (XDP, tc, kprobe, etc.)
isle-net/ # Network stack (runs as Tier 1)
Cargo.toml
src/
tcp/ # TCP/IP implementation
udp/ # UDP implementation
ip/ # IP layer (v4 + v6)
arp.rs # ARP
icmp.rs # ICMP
netfilter/ # nftables + iptables compatibility
mod.rs
nft.rs # nftables engine
conntrack.rs # Connection tracking
nat.rs # NAT (SNAT, DNAT, masquerade)
xdp/ # XDP fast path
socket.rs # Socket abstraction
tunnel/ # Tunnel protocol modules (§35)
mod.rs # TunnelDevice trait
vxlan.rs # VXLAN encap/decap
geneve.rs # Geneve encap/decap
gre.rs # GRE/GRE6
ipip.rs # IPIP/SIT
wireguard.rs # WireGuard VPN
bridge/ # Software L2 switch (§35)
mod.rs # Bridge device, FDB, STP
vlan.rs # 802.1Q VLAN filtering
veth.rs # Virtual ethernet pairs
macvlan.rs # macvlan/ipvlan devices
vrf.rs # Virtual Routing and Forwarding
isle-vfs/ # Virtual filesystem layer (Tier 1)
Cargo.toml
src/
mod.rs # VFS dispatch, mount table
ext4/ # ext4 filesystem
xfs/ # XFS filesystem
btrfs/ # btrfs filesystem
tmpfs/ # tmpfs (in-memory)
overlayfs/ # OverlayFS (for containers)
dcache.rs # Directory entry cache
isle-block/ # Block I/O layer (Tier 1)
Cargo.toml
src/
mod.rs # Block device abstraction
scheduler.rs # I/O schedulers (mq-deadline, none, bfq)
partition.rs # Partition table parsing (GPT, MBR)
dm/ # Device-mapper framework (§29)
mod.rs # DM core: target dispatch, table management
linear.rs # dm-linear
striped.rs # dm-striped
mirror.rs # dm-mirror
crypt.rs # dm-crypt (AES-XTS)
verity.rs # dm-verity
snapshot.rs # dm-snapshot (COW)
thin.rs # dm-thin-pool
md.rs # MD RAID (0/1/5/6/10) superblock compat
lvm.rs # LVM2 metadata reader
recovery.rs # Recovery-aware volume state machine
iscsi/ # iSCSI block storage (§30)
mod.rs # iSCSI common: PDU parsing, session state
initiator.rs # iSCSI initiator (RFC 7143)
target.rs # iSCSI target (LIO-compatible config)
iser.rs # iSER — RDMA transport for iSCSI
chap.rs # CHAP authentication
multipath.rs # dm-multipath integration
nvmeof/ # NVMe over Fabrics (§30)
mod.rs # NVMe-oF common: capsule parsing, queue pairs
host.rs # NVMe-oF initiator (host) — connect, I/O
target.rs # NVMe-oF target (subsystem) — nvmetcli compat
tcp.rs # NVMe/TCP transport (TP 8000)
rdma.rs # NVMe/RDMA transport (TP 8001)
discovery.rs # Discovery controller client/server
ana.rs # ANA multipath — asymmetric namespace access
isle-kvm/ # KVM hypervisor (Tier 1)
Cargo.toml
src/
mod.rs # /dev/kvm interface
vmx.rs # Intel VMX
svm.rs # AMD SVM
mmu.rs # Nested page tables (EPT/NPT)
tee/ # Confidential VM support (§26)
sev.rs # AMD SEV-SNP guest/host
tdx.rs # Intel TDX guest/host
cca.rs # ARM CCA realm management
isle-accel/ # AI/ML accelerator subsystem (§42)
Cargo.toml
src/
mod.rs # AccelBase trait, device registration
scheduler.rs # CBS-based accelerator scheduler
hmm.rs # Heterogeneous memory management
p2p.rs # Peer-to-peer DMA (PCIe, NVLink, CXL)
inference.rs # In-kernel inference engine
rdma.rs # RDMA and collective ops
isle-cluster/ # Distributed kernel (§47)
Cargo.toml
src/
mod.rs # Cluster topology, node discovery
transport.rs # KernelTransport (RDMA, CXL, TCP)
ipc.rs # Distributed IPC proxy
dsm.rs # Distributed shared memory
dlm.rs # Distributed Lock Manager (§31a)
global_pool.rs # Global memory pool
scheduler.rs # Cluster-wide scheduling
caps.rs # Network-portable capabilities
isle-driver-sdk/ # Stable driver SDK
Cargo.toml
interfaces/ # .kabi IDL definitions
block_device.kabi # Block device interface
net_device.kabi # Network device interface
gpu_device.kabi # GPU device interface
input_device.kabi # Input device interface
usb_device.kabi # USB device interface
char_device.kabi # Character device interface
pci_device.kabi # PCI device interface
platform_device.kabi # Platform device interface
src/
lib.rs # SDK entry point, driver registration
abi.rs # Generated stable ABI types
dma.rs # DMA buffer management
mmio.rs # MMIO access helpers (volatile read/write)
irq.rs # Interrupt handling
ring.rs # Ring buffer helpers for driver use
manifest.rs # Driver manifest parsing
drivers/ # In-tree drivers
tier0/ # Boot-critical (statically linked)
apic/ # Local APIC + I/O APIC
timer/ # PIT / HPET / TSC
serial/ # Early serial console
vga/ # Early VGA text console
tier1/ # Performance-critical (domain-isolated)
nvme/ # NVMe SSD driver
virtio_blk/ # VirtIO block device
virtio_net/ # VirtIO network device
virtio_gpu/ # VirtIO GPU
virtio_console/ # VirtIO console
e1000/ # Intel e1000 NIC
igb/ # Intel igb NIC
ahci/ # AHCI/SATA controller
ext4/ # ext4 driver component
tier2/ # Isolated (user-space process)
usb_xhci/ # USB XHCI host controller
usb_hid/ # USB HID (keyboard, mouse)
usb_storage/ # USB mass storage
hda_audio/ # Intel HDA audio
input/ # Input subsystem (evdev)
tools/
kabi-compiler/ # .kabi IDL -> Rust/C code generator
Cargo.toml
src/
main.rs
parser.rs # IDL parser
codegen_rust.rs # Rust binding generator
codegen_c.rs # C binding generator
kabi-compat-check/ # ABI compatibility CI checker
Cargo.toml
src/
main.rs # Diffs old vs new .kabi, rejects breaks
isle-initramfs/ # Initramfs builder tool
Cargo.toml
src/
main.rs # Packs drivers + early userspace
arch/ # Architecture-specific C/asm
x86_64/
boot/ # UEFI/BIOS boot stub (C + asm)
header.S # Linux boot protocol header
main.c # Early C boot code
efi_stub.c # UEFI stub
asm/
entry.S # Syscall entry/exit
switch.S # Context switch
irq_stubs.S # Interrupt stub table
vdso/
vdso.lds # vDSO linker script
clock_gettime.c # clock_gettime implementation
getcpu.c # getcpu implementation
aarch64/
boot/ # ARM64 boot stub
asm/ # ARM64 assembly
vdso/ # ARM64 vDSO
riscv64/
boot/ # RISC-V boot stub
asm/ # RISC-V assembly
vdso/ # RISC-V vDSO
ppc32/
boot/ # PPC32 boot stub
asm/ # PPC32 assembly
vdso/ # PPC32 vDSO
ppc64le/
boot/ # PPC64LE boot stub
asm/ # PPC64LE assembly
vdso/ # PPC64LE vDSO
tests/
abi_compat/ # Old driver binaries for compat regression
syscall/ # Linux syscall conformance (LTP-based)
driver/ # Driver integration tests
bench/ # Performance regression benchmarks
crash_recovery/ # Fault injection + recovery verification
Appendix C. What ISLE Provides That Linux Cannot
| Feature | Linux | ISLE |
|---|---|---|
| Driver crash recovery | Kernel oops or panic depending on fault type. Many driver bugs produce oops (system continues with degraded functionality) rather than panic. Recovery requires at minimum driver module reload; severe faults cause panic and full reboot (30-60s). | Reload driver in ~50-150ms (Tier 1) or ~10ms (Tier 2) |
| Stable driver ABI | None (recompile every update) | Versioned, append-only, binary-stable KABI |
| Driver isolation | None (shared address space) | Domain isolation + IOMMU (Tier 1), full process (Tier 2) |
| Capability-based security | Bolt-on (POSIX caps are coarse) | Foundational architecture |
| Lock ordering enforcement | Runtime lockdep (debug only) | Compile-time via Rust type system: type-level lock ordering using phantom type parameters that encode lock level in the type signature (e.g., Lock<Level3>), preventing out-of-order acquisition at compile time. See isle-core lock design (Section 16). |
| io_uring security | Bypasses syscall monitoring | Per-instance operation whitelist |
| Hot driver upgrade | Fragile (unstable ABI) | Clean stop/start with stable KABI |
| Memory safety | C everywhere | Rust with minimal unsafe at hardware boundaries |
| Many-core scalability | known bottlenecks (RTNL for networking, inode_lock for VFS, cgroup_mutex for cgroups) | No global locks, per-CPU/per-NUMA everywhere |
| Proactive fault management | Ad-hoc (mcelog, rasdaemon) | Unified FMA with diagnosis engine (Section 39) |
| Memory compression | zswap/zram (separate, config-heavy) | Integrated NUMA-aware zpool tier (Section 13) |
| CPU bandwidth guarantee | No floor mechanism | CBS-backed cpu.guarantee (Section 15) |
| Stable observability ABI | Tracepoints are unstable | Versioned, documented stable tracepoints (Section 40) |
| Verified boot chain | Fragmented (UEFI SB + IMA + dm-verity) | Unified chain from firmware to drivers (Section 22) |
| Kernel object introspection | Per-subsystem (/proc, /sys, scattered) | Unified object namespace via islefs (Section 41) |
| Driver state preservation | Lost on crash — cold restart | Checkpointed state buffer, warm restart (Section 9) |
| Core panic diagnostics | kexec + kdump (complex setup) | In-place crash dump to reserved memory (Section 9) |
| Context switch XSAVE cost | Eager XSAVE with XSAVEOPT/XSAVES optimizations (skips unmodified components, but still saves full state for context switches involving SIMD). ISLE's lazy approach avoids save/restore entirely for non-SIMD threads. | Lazy XSAVE — zero cost for non-SIMD threads (Section 14.6) |
| CPU errata management | Scattered #ifdef, ad-hoc | Structured quirk table + boot-param controls (Section 17.4) |
| Volume layer + driver crash | Device marked failed, RAID resync | Recovery-aware: pause I/O, resume clean (Section 29) |
| VM guest driver crash | VM reboot required | Driver recovers in-place, hypervisor unaware (Section 37) |
| Block storage networking | Separate stacks (open-iscsi, nvme-cli, no unified recovery) | Unified iSCSI + NVMe-oF with RDMA upgrade and crash recovery (Section 30) |
| Clustered FS + driver crash | Node fenced, ejected from cluster | Driver recovers in-place, node stays in cluster (Section 31) |
| Distributed locking | TCP-based DLM (~10-100 μs/op depending on lock locality; local locks <1 μs), global recovery quiesce on any node failure | RDMA-native DLM (~2-3 μs uncontested, ~5-10 μs contested), per-resource recovery, lease-based extension, batch ops (Section 31a) |
| TPM key management | Userspace daemon (tpm2-abrmd) | Kernel-native resource manager + capability integration (Section 23) |
| Runtime integrity | IMA bolted onto VFS, optional | Integrated with capability system and driver loading (Section 24) |
| Display stack crash | X/Wayland session lost | DMA-BUF survives driver reload, compositor stalls ~50-150ms (Section 46.2.6) |
Appendix D. Cross-Feature Integration Map
D.1 Cross-Feature Integration Map
These features are not independent — they reinforce each other:
Formal verification (§50) ──────► Confidential computing (§26)
Proves capability system correct Relies on correct capability enforcement
Safe extensibility (§51) ◄──────► Live evolution (§52)
Policy modules are hot-swappable Evolution uses the same mechanism
Intent-based management (§53) ◄──► In-kernel inference (Section 45)
Intent optimizer uses learned models Models optimize for declared intents
EAS / heterogeneous CPU (Section 14.5) ◄──► Power budgeting (§49)
EAS picks energy-optimal core Power budget enforces watt cap
Power budgeting (§49) ◄──────► Intent-based management (§53)
Power budget is a constraint Intents include efficiency preference
Hardware memory safety (§19) ──────► Tier 1 driver isolation (Section 6)
MTE catches C driver bugs Domain isolation catches the resulting faults
Confidential computing (§26) ──────► Distributed kernel (Section 47)
TEE-to-TEE RDMA DSM coherence for encrypted pages
Post-quantum crypto (§25) ──────► Distributed capabilities (Section 47.9)
PQC signatures on capabilities Network-portable across cluster
SmartNIC/DPU (§48) ◄──────► Distributed kernel (Section 47)
DPU = close remote node Same proxy driver pattern
Persistent memory (§32) ◄──────► Memory tiers (Section 43)
Persistent memory = another tier Managed by same PageLocationTracker
Computational storage (§33) ◄──► Accelerator framework (Section 42)
CSD = storage accelerator Same AccelBase vtable
Unified compute (§54) ◄──────► EAS / heterogeneous CPU (Section 14.5)
Multi-dim capacity extends scalar CPU capacity is a special case
Unified compute (§54) ◄──────► Accelerator scheduler (Section 42.2.4)
Cross-device topology + energy data Accel scheduler consumes advisory
Unified compute (§54) ◄──────► Power budgeting (§49)
Workload profile drives throttle Informed cross-device power decisions
Unified compute (§54) ◄──────► Intent-based management (§53)
compute.weight feeds intent optimizer Optimizer adjusts per-domain knobs
Unified compute (§54) ◄──────► Distributed kernel (Section 47)
Peer kernel nodes via NodeTransport Accelerator = close compute node
Unified compute (§54) ◄──────► SmartNIC/DPU offload (§48)
Same convergence: device → peer node NodeTransport unifies both transports
Distributed Lock Manager (§31a) ◄──► RDMA transport (Section 47.3)
DLM uses RDMA CAS/Send for locks Transport provides kernel RDMA API
Distributed Lock Manager (§31a) ◄──► Cluster membership (Section 47.11)
DLM receives join/leave/dead events Single heartbeat source for both
Distributed Lock Manager (§31a) ◄──► Clustered filesystems (Section 31)
GFS2/OCFS2 use DLM for coordination DLM lock modes map to FS operations
Distributed Lock Manager (§31a) ◄──► Driver recovery (Section 9)
DLM in isle-core survives driver crash No lock recovery needed on Tier 1 reload
Bootstrap Circular Dependency:
The intent optimizer (Section 53) uses in-kernel inference models (Section 45), but those models may not be loaded at early boot. Resolution: the intent optimizer degrades gracefully to static defaults when models are unavailable. At boot: 1. Intent optimizer starts with hardcoded heuristics (e.g., "latency target → raise cpu.weight by 20%"). 2. When the inference engine loads models (typically within seconds of boot), the optimizer transitions to learned optimization. 3. The transition is seamless — no reconfiguration needed.
D.2 Implementation Dependency Graph
Foundation (no dependencies):
├── Formal verification readiness (§50) — design methodology
├── Post-quantum crypto abstraction (§25) — data structure sizing
└── Real-time preemption model (§16) — lock design
Early integration:
├── Hardware memory safety (§19) — needs memory allocator
├── Power budgeting (§49) — needs scheduler
└── Safe extensibility (§51) — needs KABI vtable mechanism
Mid integration:
├── Confidential computing (§26) — needs memory manager, IOMMU
├── Intent-based management (§53) — needs inference engine, cgroups
└── Live evolution (§52) — needs extensibility mechanism
Late integration:
├── SmartNIC/DPU offload (§48) — needs proxy driver, device registry
├── Persistent memory (§32) — needs VFS, memory tiers
├── Computational storage (§33) — needs AccelBase framework
├── Unified compute topology (§54) — needs AccelBase, EAS (Section 14.5), power budgeting (§49)
└── Peer kernel nodes (§54.13) — needs unified compute + distributed kernel (Section 47)
Appendix E. Open Questions
The following cross-cutting items require further design work. Each is tracked as an open question with the affected sections and the specific decision to be made.
io_uring integration (affects §20.5, §26, §52, §48):
- Registered buffers in confidential computing: io_uring pre-registers DMA buffers
at setup time. When a VM runs under SEV-SNP, these buffers must be in shared
(unencrypted) memory. Decision needed: register-time enforcement vs. lazy conversion.
- State migration during live evolution: io_uring's SQ/CQ rings, registered files, and
registered buffers constitute persistent state. The live evolution framework (§52) needs
a StateSerializer for io_uring context. Decision needed: drain-and-recreate vs.
in-place serialization.
- DPU submission offload: DPUs can process io_uring submission queues directly, bypassing
host CPU for network and storage operations. Decision needed: how the DPU reads SQ
entries (shared memory mapping vs. DMA push) and how completions are posted back to CQ.
GPU virtualization (affects §42, §26): - Confidential GPU VMs require that GPU VRAM is encrypted and attestable. SEV-SNP does not natively protect PCIe device memory. TDX Connect (Intel) and ARM CCA device assignment are emerging but not yet stable. Decision needed: software bounce buffer path (safe, slow) vs. hardware-assisted device encryption (fast, hardware-dependent). - Nested virtualization with GPU passthrough: a confidential VM running a nested hypervisor that passes through a GPU adds three layers of IOMMU translation. Decision needed: whether to support this (performance may be prohibitive).
Testing strategy for cross-feature interactions (affects all §26-§56): - Combinatorial explosion: 15 features yield 105 pairwise interactions. Exhaustive testing is infeasible. Prioritized critical pairs: 1. RT + confidential computing (latency impact of memory encryption) 2. Power budgeting + intent optimization (conflicting objectives) 3. MTE + DSM page migration (tag preservation across RDMA transfer) 4. Live evolution + RT (component swap during hard-RT operation) 5. DPU offload + confidential computing (encrypted DPU-host channel) - Decision needed: test matrix, CI infrastructure, acceptance thresholds for each pair.
Secure boot measurement chain (affects §22, §26, §51, §52): - Live kernel evolution (§52) replaces kernel components at runtime. Each new component must be measured into a TPM PCR (or TDX MRTD) before activation to maintain the attestation chain. Decision needed: which PCR to extend (dedicated PCR for dynamic components vs. shared with static measurement), and how remote attestors distinguish "legitimate evolution" from "compromised kernel." - Policy modules (§51) loaded at runtime must also be measured. Decision needed: whether measurement is mandatory (blocks unsigned modules) or advisory (measure but allow).
CXL 3.0 fabric management (affects §47, §32, §54): - CXL 3.0 introduces fabric-attached memory with hardware-managed coherence. Decision needed: how this integrates with the distributed kernel's software DSM protocol (§47.5). Options: CXL replaces DSM for intra-rack, DSM remains for inter-rack; or DSM degrades gracefully when CXL is available.
Multi-architecture parity for advanced features (affects §18, §26-§56): - Many features in §26-§56 specify x86-64 mechanisms (WRPKRU, SEV-SNP, RAPL, MTE). ARM64, RISC-V, and PowerPC equivalents exist for some but not all. Partially addressed: §18 now includes an "Advanced Feature Architecture Parity" matrix covering 8 key features across all six architectures. Remaining decision: per-feature acceptance criteria for "software fallback" vs. "not supported" (performance thresholds, testing requirements).
eBPF verifier completeness (affects §20.4): - The Linux eBPF verifier has had multiple privilege escalation CVEs (CVE-2021-31440, CVE-2021-3490, CVE-2022-23222) stemming from incorrect bounds tracking, register state pruning, and speculative execution handling. ISLE's clean-room Rust reimplementation must achieve equivalent safety guarantees without inheriting these bug classes. Decision needed: formal verification target (full verifier including all helper function type-checking, or just the bounds-checking core and control flow analysis?), and what subset of eBPF features to support in Phase 2 (conservative: XDP and socket filters only) vs. Phase 3 (full Linux parity including tracing, cgroup, and struct_ops programs).
io_uring + SEV-SNP shared buffer management (affects §20.5, §26):
- io_uring's registered buffer mechanism pre-pins buffers for zero-copy I/O. Under
SEV-SNP, these buffers must reside in the shared (C-bit clear) memory region for
device DMA access. However, shared memory is unencrypted and visible to the host VMM.
Decision needed: per-buffer encryption/decryption at registration time using AES-GCM
(performance cost: ~1 microsecond per 4 KiB page for encrypt + MAC, acceptable for
storage but potentially significant for high-IOPS NVMe workloads), or accept that
io_uring registered buffers are in plaintext. The plaintext approach is acceptable for
block storage (ciphertext is on disk anyway) but not for network payloads containing
secrets (TLS session keys, authentication tokens). A per-buffer policy flag
(IORING_REGISTER_BUFFERS_ENCRYPTED) could let applications choose.
GPU VRAM encryption for confidential VMs (affects §42, §26): - NVIDIA H100 supports CC (Confidential Computing) mode with hardware-encrypted VRAM and attestable GPU firmware. AMD MI300X does not yet support VRAM encryption in SEV-SNP mode (GPU memory is outside the encrypted memory boundary). Decision needed: software fallback path where GPU computations operate on encrypted host memory via bounce buffers (10-100x slower due to PCIe round-trips and CPU-side encryption), or mark GPU passthrough as "not confidential" on AMD platforms until hardware support lands. A third option: restrict confidential GPU workloads to inference-only (model weights are public, only input/output needs encryption) and encrypt only the host-to-GPU and GPU-to-host transfer buffers.
CXL 3.0 coherence domain interaction with DSM (affects §47.5, §47.12): - CXL 3.0 Type 3 devices provide hardware-coherent shared memory between hosts via the CXL.mem protocol with back-invalidate support. This capability overlaps with the software DSM protocol defined in Section 47.5. Decision needed: when CXL 3.0 fabric is available between two nodes, should the DSM protocol defer entirely to CXL hardware coherence (simpler, lower latency at ~200ns vs. ~5 microseconds for software DSM, but limited to CXL-connected nodes within a single rack), or should DSM provide a unified abstraction that uses CXL as a fast transport underneath (more complex, but uniform API across CXL and non-CXL nodes)? The hybrid approach adds a transport-selection layer to DSM that routes coherence traffic over CXL when available and falls back to RDMA otherwise.
Live Evolution attestation chain (affects §52, §22): - When a kernel component is hot-swapped via live evolution (Section 52), the TPM measurement chain (Section 22) must be updated to reflect the new component. Decision needed: extend a dedicated PCR (PCR 14?) with the hash of each new component as it is loaded, or re-measure the entire kernel image into a single PCR. The former is simpler and incremental but creates a growing measurement log that remote attestors must replay to verify; the latter is cleaner for attestation (single expected PCR value per known kernel configuration) but requires knowing the full kernel composition at re-measurement time. A related question: should the attestation quote include a manifest of all live-evolved components and their versions, separate from PCR values?
This document is the canonical reference for ISLE development. All implementation decisions must be traceable to the architecture described here. Changes to this document require team review and approval.