Sections 20–21 of the ISLE Architecture. For the full table of contents, see README.md.

Part V: Linux Compatibility

Native POSIX syscall interface, Linux API coverage, and deliberately dropped APIs.

20. Syscall Interface

20.1 Design Goal

ISLE is a POSIX-compatible kernel. Of the approximately 450 defined Linux x86-64 syscalls, approximately 330-350 are actively used by current software (glibc 2.17+, musl 1.2+, systemd, Docker, Kubernetes). The remaining approximately 100-120 are obsolete and return -ENOSYS unconditionally.

Of the 330-350 active syscalls: - ~80% (~265-280) are implemented natively with identical POSIX semantics — read, write, open, mmap, fork, socket, etc. are ISLE's own API, not a translation layer over something else. The syscall entry point performs representation conversion (untyped C ABI → typed Rust internals), not semantic translation. - ~15% (~50-55) need thin adaptation (e.g., Linux's untyped ioctl → ISLE's typed driver interface). - ~5% (~15-20) are genuine compatibility shims for deprecated syscalls that get remapped to modern equivalents.

20.2 Syscall Dispatch Architecture

The SyscallHandler enum classifies every syscall by how it is serviced. The first three variants (Direct, InnerRingForward, OuterRingForward) are native implementations — ISLE's own kernel code handling the syscall directly. Only Emulated is a compatibility shim:

pub enum SyscallHandler {
    /// Handled directly in ISLE Core -- no tier crossing
    /// Examples: getpid, brk, mmap, clock_gettime, signals, futex
    Direct(fn(&mut SyscallContext) -> i64),

    /// Forwarded to a Tier 1 driver via domain switch
    /// Examples: read, write, ioctl, socket ops, mount
    InnerRingForward {
        driver_class: DriverClass,
        handler: fn(&mut SyscallContext) -> i64,
    },

    /// Forwarded to a Tier 2 driver via IPC
    /// Examples: USB-specific ioctls
    OuterRingForward {
        driver_class: DriverClass,
        handler: fn(&mut SyscallContext) -> i64,
    },

    /// Compatibility shim for deprecated-but-still-called syscalls
    /// Examples: select (mapped to pselect6), poll (mapped to ppoll)
    Emulated(fn(&mut SyscallContext) -> i64),

    /// Not implemented -- returns -ENOSYS
    /// Examples: old_stat, socketcall, ipc multiplexer
    Unimplemented,
}

20.3 Virtual Filesystems

These synthetic filesystems are critical for compatibility. Many Linux tools parse them directly and will break if the format is even slightly wrong.

Filesystem	Implementation	Critical consumers
`/proc`	Synthetic, generated from kernel state	`ps`, `top`, `htop`, systemd, Docker
`/sys`	Reflects device tree from bus manager	`udev`, systemd, `lspci`, `lsusb`
`/dev`	Maps to KABI device interfaces	Everything (devtmpfs-compatible)
`/dev/shm`	tmpfs shared memory	POSIX `shm_open`, Chrome, Firefox
`/run`	tmpfs	systemd, `dbus`, PID files

Key /proc entries that must be pixel-perfect:

/proc/meminfo -- parsed by free, top, OOM killer
/proc/cpuinfo -- parsed by many applications for CPU feature detection
/proc/[pid]/maps -- parsed by debuggers, profilers, JVMs
/proc/[pid]/status -- parsed by ps, container runtimes
/proc/[pid]/fd/ -- used by lsof, process managers
/proc/self/exe -- readlink used by many applications to find themselves
/proc/sys/ -- sysctl interface for kernel tuning

20.4 Complete Feature Coverage

These features must be designed into the architecture from day one. They cannot be bolted on later.

eBPF Subsystem

Full eBPF virtual machine (register-based, 11 registers, 64-bit)
Verifier: static analysis ensuring program safety (bounded loops, memory safety, no uninitialized reads)
JIT compiler: eBPF bytecode to native code, per architecture:
x86-64: Phase 2 (primary development target)
AArch64: Phase 3 (A64 instruction emission)
ARMv7: Phase 4 (Thumb-2 instruction emission)
RISC-V 64: Phase 3 (RV64 instruction emission)
PPC32: Phase 4 (PPC32 instruction emission)
PPC64LE: Phase 3 (PPC64 instruction emission)
Interpreted fallback available on all architectures from Phase 1
Program types: XDP, tc (traffic control), kprobe, tracepoint, cgroup, socket filter, LSM, struct_ops
Map types: hash, array, ringbuf, per-CPU hash, per-CPU array, LRU hash, LPM trie, queue, stack
bpftool compatibility for loading and inspecting programs
Required for: bpftrace, Cilium (Kubernetes networking), Falco (security), BCC tools

Relationship to KABI policy hooks: eBPF provides Linux-compatible user-to-kernel extensibility for tracing, networking, and security (the same role as in Linux). ISLE's KABI driver model (Section 5) supports kernel-internal extensibility via vtable-based driver interfaces — drivers can register policy callbacks for scheduling, memory, and I/O decisions. The two mechanisms are complementary: eBPF serves the Linux ecosystem (existing tools, user-authored programs); KABI serves kernel evolution (vendor-provided policy drivers, hardware-specific optimizations).

eBPF Verifier Architecture

The verifier is the highest-risk component in the syscall interface. ISLE implements a clean-room Rust reimplementation (not a port of Linux's C verifier), leveraging Rust's type system to make verifier invariants compile-time enforced where possible.

Abstract interpretation: Forward dataflow analysis tracking register types and value ranges through every reachable instruction. At branch points, both paths are explored. At join points, register states are merged conservatively (widening).

Register type system: Each eBPF register carries a tracked type:

Type	Meaning	Allowed Operations
`SCALAR_VALUE`	Unknown integer with tracked min/max bounds	Arithmetic, comparison
`PTR_TO_MAP_VALUE`	Pointer into a BPF map value	Read/write within value size
`PTR_TO_CTX`	Pointer to program context	Field access at known offsets
`PTR_TO_STACK`	Pointer to 512-byte stack	Read/write within stack bounds
`PTR_TO_PACKET`	Network packet data pointer	Read after bounds check
`PTR_TO_BTF_ID`	Typed kernel pointer (BTF-guided)	Field access per BTF layout

Bounds checking: Every pointer arithmetic updates the tracked offset range. Before any memory load/store, the verifier proves bounds statically — no runtime checks inserted.

Loop handling: Bounded loops (Linux 5.3+ semantics) supported. Maximum verifier instruction exploration count: 1 million (BPF_COMPLEXITY_LIMIT_INSNS, matching Linux since kernel 5.2). Maximum program size: 4,096 instructions for unprivileged programs (BPF_MAXINSNS), 1 million for privileged. These are separate limits: the exploration count covers all paths the verifier must check, while the program size is the actual instruction count. Unbounded loops rejected.

Stack depth: Maximum 512 bytes per frame, verified statically. BPF-to-BPF call depth max 8 frames, each with up to 512 bytes of stack. Tail call chain depth max 33 (MAX_TAIL_CALL_CNT, matching Linux 5.12+; earlier kernels used 32). Total per-frame stack: 512 bytes.

eBPF Verifier Risk Mitigation

A verifier bug equals kernel compromise. ISLE applies defense-in-depth:

Isolation domain: eBPF programs execute in a Tier 1 isolation domain. On x86-64, this uses one of the 12 Tier 1 PKEYs (2-13) — BPF programs share PKEY allocation with drivers, reducing the maximum simultaneous Tier 1 drivers by one when BPF programs are loaded. On other architectures (AArch64, ARMv7, RISC-V, PPC), BPF uses page-table-based isolation with no key limit. Even with a verifier bug, BPF code cannot access isle-core state (PKEY 0) directly.
Capability-gated loading: Only CAP_BPF holders can load programs. Unprivileged eBPF loading disabled by default.
Differential testing: ISLE verifier tested against Linux verifier on >50,000 known-good and known-bad programs. Any divergence is investigated.
Rust type safety: Invalid state transitions are compile-time errors, not runtime checks.

BPF Isolation Model

BPF programs are a cross-cutting concern used beyond networking: tracing (kprobe, tracepoint), security (LSM, seccomp), scheduling (struct_ops), and packet filtering (XDP, tc) all execute BPF code. The full BPF isolation model — domain confinement, map access control, capability-gated helpers, cross-domain packet redirect rules, and verifier enforcement — is specified in Section 35.2 (Packet Filtering, BPF-Based). Although Section 35.2 is located in the Networking part, its isolation rules apply to all BPF program types, not just networking hooks. Every BPF program, regardless of attachment point, executes in a dedicated BPF isolation domain (separate from isle-core and the subsystem that loaded it) and accesses kernel state only through verified BPF helpers that perform cross-domain access on the program's behalf.

KVM Hypervisor

Implemented as a Tier 1 driver exposing the /dev/kvm interface
Full x86-64 VMX support:
Nested paging (EPT)
VMCS shadowing (for nested virtualization)
Posted interrupts (for efficient interrupt delivery)
PML (Page Modification Logging)
QEMU/KVM, libvirt, Firecracker, Cloud Hypervisor must work unmodified

ARM64 KVM (VHE/nVHE):

ARM64 KVM uses the Virtualization Extensions (ARMv8.1+). Two modes are supported:

VHE (Virtualization Host Extensions, ARMv8.1+):
  - Host kernel runs at EL2 (hypervisor exception level) instead of EL1.
  - Guest runs at EL1 (virtual EL1, translated by VHE).
  - Benefit: no world switch needed for host kernel — host IS the hypervisor.
  - VTTBR_EL2 points to guest's Stage-2 translation tables.
  - Guest physical → host physical translation via Stage-2 page tables.
  - Used on: AWS Graviton, Ampere, Apple Silicon, Cortex-X series.

nVHE (non-VHE, pre-ARMv8.1 or when VHE is disabled):
  - Host kernel runs at EL1. Hypervisor stub at EL2.
  - Guest entry requires EL1 → EL2 → EL1(guest) transition.
  - Higher overhead (~500-1000 cycles per VM entry/exit vs ~200 for VHE).
  - ISLE supports nVHE for older ARM64 hardware but defaults to VHE.

Protected KVM (pKVM, ARMv8.0+):
  - EL2 hypervisor is a small, deprivileged module (~5K lines).
  - Host kernel runs at EL1 with restricted Stage-2 mappings.
  - Guest memory is inaccessible to the host (confidential VMs without TEE).
  - Aligns with ISLE's isolation model: pKVM enforces VM isolation in hardware.

ARM64 KVM integration with ISLE isolation: - On ARM64, the isolation mechanism is POE/page-table (not MPK). KVM uses a Stage-2 trampoline analogous to the x86 VMX trampoline: isle-core manages VTTBR_EL2 and HCR_EL2 writes; isle-kvm prepares the VM configuration in its own isolation domain. The trampoline validates Stage-2 page tables before executing the ERET to enter the guest. - PSCI (Power State Coordination Interface) for vCPU bring-up: KVM intercepts PSCI calls from the guest via HVC/SMC trapping in HCR_EL2. - Virtual GIC (vGICv3/vGICv4): Interrupt injection uses GICv4 direct injection where available (zero exit for most interrupts), falling back to software injection.

RISC-V KVM (H-extension):

RISC-V virtualization is defined by the H (Hypervisor) extension (ratified December 2021, as part of Privileged Architecture v1.12):

H-extension architecture:
  - Hypervisor runs in HS-mode (Hypervisor-extended Supervisor mode).
  - Guest runs in VS-mode (Virtual Supervisor mode).
  - hstatus CSR: hypervisor status (SPV bit tracks guest/host context).
  - hgatp CSR: guest physical → host physical address translation
    (analogous to EPT on x86 and Stage-2 on ARM).
  - htval CSR: faulting guest physical address (for #PF handling).
  - hvip/hip/hie CSRs: virtual interrupt injection.
  - Guest trap delegation: hedeleg/hideleg CSRs control which traps
    go to VS-mode (guest handles) vs HS-mode (hypervisor handles).

VM entry/exit:
  - Entry: set hstatus.SPV = 1, sret → enters VS-mode.
  - Exit: guest trap/interrupt → HS-mode handler (automatic by hardware).
  - Cost: ~200-400 cycles per exit (varies by implementation).

IOMMU: RISC-V IOMMU spec (ratified June 2023) provides Stage-2 translation
for device DMA, analogous to Intel VT-d / ARM SMMU.

RISC-V KVM integration with ISLE: - The isle-kvm driver manages hgatp (guest page tables) and hvip (virtual interrupts) in its isolation domain. The HS-mode trampoline validates hgatp entries before guest entry. - H-extension hardware is available on SiFive P670, T-Head C910, and QEMU virt. ISLE targets QEMU for initial development.

KVM and Domain Isolation — KVM presents a unique challenge for the Tier 1 model. Unlike a NIC or storage driver that accesses a single device via MMIO, KVM requires: (1) VMX root mode transitions (VMXON, VMLAUNCH, VMRESUME), which are privileged Ring 0 operations that affect global CPU state; (2) VMCS manipulation, which Intel requires to be in a specific memory region pointed to by a per-CPU VMCS pointer; (3) EPT (Extended Page Table) management, which programs second-level page tables that control guest physical-to-host physical address translation; (4) direct access to MSRs and control registers during VM entry/exit.

A standard isolation domain cannot provide these capabilities — WRPKRU controls memory access permissions, not instruction execution privilege. KVM is therefore designated as a Tier 0.5 driver: it runs with its own isolation domain (like Tier 1) but is additionally granted CAP_VMX — a capability that allows isle-core to execute VMX operations on KVM's behalf via a VMX trampoline. The trampoline runs in PKEY 0 (ISLE Core domain) and performs the actual VMLAUNCH/VMRESUME. KVM prepares the VMCS and EPT in its own isolation domain; the trampoline validates the VMCS fields (no host-state corruption, EPT doesn't map ISLE Core pages writable to the guest), then executes the VM entry.

Why not Tier 0? — Tier 0 code cannot crash-recover. By keeping KVM as Tier 0.5 (domain-isolated with a validated trampoline), a bug in KVM's VMCS preparation or ioctl handling crashes only KVM, not ISLE Core. The VMX trampoline itself is ~200 lines of verified assembly — small enough to audit as Tier 0 code.

Recovery implications — When isle-kvm crashes, all running VMs are paused (their vCPU threads are halted). After isle-kvm reloads (~150ms, FLR path for any assigned devices), the VMCS state for each VM is reconstructed from the checkpointed state buffer (Section 9). VMs resume without guest-visible interruption beyond a brief pause. If reconstruction fails, the VM is terminated (same outcome as a host kernel crash in Linux, but without affecting other VMs or the host).

KVM Integration with isle-core Memory Management:

KVM's Extended Page Tables (EPT on x86, Stage-2 on ARM, hgatp on RISC-V) require tight integration with isle-core's memory management subsystem (Section 12):

Second-Level Address Translation (SLAT) hooks:

/// isle-core provides these hooks to isle-kvm for EPT/Stage-2 management.
/// Each hook operates on host physical frames and guest physical addresses.
pub trait SlatHooks {
    /// Allocate a physical page for SLAT page table structures (EPT/Stage-2/hgatp
    /// page table entries). These are hypervisor metadata pages used to build the
    /// second-level address translation tables — NOT guest physical memory backing
    /// pages. Returns a pinned frame suitable for use as a page table page.
    fn alloc_slat_page(&self) -> Result<PhysFrame, KernelError>;

    /// Free a SLAT page table structure page previously allocated by
    /// `alloc_slat_page`.
    fn free_slat_page(&self, frame: PhysFrame);

    /// Allocate a physical page to back guest physical memory. This is the host
    /// physical frame that the guest will use as RAM — mapped into the SLAT tables
    /// as a leaf entry. Distinct from `alloc_slat_page`, which allocates page table
    /// structure pages (internal SLAT nodes).
    fn alloc_guest_page(&self) -> Result<PhysFrame, KernelError>;

    /// Free a guest physical memory backing page previously allocated by
    /// `alloc_guest_page`, returning it to isle-core's buddy allocator.
    fn free_guest_page(&self, frame: PhysFrame);

    /// Pin a host physical page to prevent reclaim or migration while it is
    /// mapped in an EPT/Stage-2 table. The page remains pinned until the
    /// corresponding `unpin_host_page` call.
    fn pin_host_page(&self, frame: PhysFrame) -> Result<(), KernelError>;

    /// Unpin a host physical page, allowing isle-core to reclaim or migrate it.
    fn unpin_host_page(&self, frame: PhysFrame);

    /// Notify isle-core that a guest physical to host physical mapping was created.
    /// Used for dirty page tracking and live migration bookkeeping.
    fn notify_slat_map(&self, gpa: u64, hpa: u64, size: usize, writable: bool);

    /// Notify isle-core that a SLAT mapping was removed.
    fn notify_slat_unmap(&self, gpa: u64, size: usize);
}

Memory overcommit: isle-kvm can overcommit guest memory (assign more virtual memory to VMs than is physically available). When a guest accesses an unmapped guest physical page, the EPT violation is handled through a five-step path:

VM exit to trampoline: The EPT/Stage-2/hgatp violation triggers a VM exit. The VMX trampoline (running in PKEY 0/isle-core) captures the faulting guest physical address from VMCS (x86), FAR_EL2 (ARM), or htval (RISC-V).
Synchronous upcall to isle-kvm: The trampoline performs a direct function call (not ring buffer IPC) to isle-kvm's page fault handler. This is safe because:
The call is synchronous within the vCPU thread context (no concurrency with other isle-kvm operations on this vCPU).
isle-kvm's page fault handler runs in its isolation domain but accesses only its own per-VM data structures.
The trampoline validates the fault is a legitimate EPT violation (not a malicious call from compromised code) before invoking isle-kvm.

Direct call latency: ~30-50 cycles (domain switch + indirect call), NOT the ~200+ cycle ring buffer round-trip used for asynchronous driver IPC.

Page request: isle-kvm requests a guest backing page from isle-core via SlatHooks::alloc_guest_page (another direct call, isle-core is PKEY 0).
Page allocation: isle-core allocates from the buddy allocator, potentially reclaiming pages from page cache, compressing cold pages (Section 13), or evicting pages from other guests based on the memory pressure framework.
Mapping and resume: isle-kvm installs the EPT/Stage-2 mapping in its per-VM page tables and returns to the trampoline, which resumes the guest via VMRESUME/ERET.

Total EPT violation latency: ~200 cycles (VM exit) + ~50 cycles (trampoline + domain switch) + ~100-500 cycles (page allocation, varies by pressure) + ~200 cycles (VM entry) = ~550-950 cycles for a page-in from free list. This is comparable to Linux KVM's EPT violation handling (~400-800 cycles on similar hardware).

Dirty page tracking for live migration uses architecture-specific mechanisms:

PML (Page Modification Logging) on Intel: hardware logs dirty guest physical addresses to a 512-entry buffer in the VMCS. When the buffer fills, a VM exit occurs and isle-kvm drains the buffer into a per-VM dirty bitmap.
Software dirty tracking on ARM/RISC-V: isle-kvm clears the write permission bit in Stage-2/hgatp entries. Write faults trap into isle-kvm, which records the dirty page in the bitmap and restores write permission. Batched permission restoration amortizes the TLB invalidation cost.
isle-core maintains per-VM dirty bitmaps (one bit per 4 KiB page) that can be queried and atomically reset by the migration coordinator.

Ballooning integration: The virtio-balloon driver in the guest inflates (returns pages to the host) or deflates (reclaims pages from the host). isle-kvm processes balloon requests by calling free_guest_page on inflation (returning the host physical frame to isle-core's buddy allocator) and alloc_guest_page on deflation (allocating a new guest backing frame and installing the EPT mapping). Balloon state is included in the isle-kvm checkpoint for crash recovery (Section 9).

Netfilter / nftables

Tier 1 network stack includes the nftables packet classification engine
iptables legacy compatibility via the nft backend (same approach as modern Linux)
Connection tracking (conntrack) for stateful firewalling
NAT support: SNAT, DNAT, masquerade
Required for: Docker networking, Kubernetes kube-proxy (iptables mode), firewalld

Linux Security Modules (LSM)

LSM hook framework at all security-relevant points (file access, socket operations, task operations, IPC, etc.)
SELinux policy engine compatibility (required for RHEL/CentOS/Fedora)
AppArmor profile compatibility (required for Ubuntu/SUSE)
Capability-based hooks integrate naturally with ISLE's native capability model
seccomp-bpf for per-process syscall filtering (required for Docker, Chrome)

Namespaces

All 8 Linux namespace types:

Namespace	Purpose	Required for
`mnt`	Mount point isolation	Containers, chroot
`pid`	Process ID isolation	Containers
`net`	Network stack isolation	Containers, VPN
`ipc`	IPC resource isolation	Containers
`uts`	Hostname/domainname isolation	Containers
`user`	UID/GID mapping	Rootless containers
`cgroup`	Cgroup hierarchy isolation	Containers
`time`	Clock offset isolation	Containers

Cgroups

cgroup v2 as primary implementation (unified hierarchy)
cgroup v1 compatibility mode (required for older Docker, systemd < 248)
Controllers: cpu, cpuset, memory, io, pids, rdma, hugetlb, misc
Required for: systemd resource management, Docker, Kubernetes, OOM handling

20.5 io_uring Compatibility

Full io_uring support with a security enhancement:

Same SQE/CQE ring buffer ABI (binary compatible)
Same opcodes: IORING_OP_READ, WRITE, READV, WRITEV, FSYNC, POLL_ADD, ACCEPT, CONNECT, SEND, RECV, OPENAT, CLOSE, STATX, PROVIDE_BUFFERS, etc.
SQPOLL mode (kernel-side submission polling)
Registered buffers and registered files (pre-pinned for zero-copy)
Fixed files for reduced file descriptor overhead

Advanced io_uring features:

Multishot operations (IORING_POLL_ADD_MULTI, multishot accept, multishot recv): single SQE generates multiple CQEs, reducing submission overhead for event-driven servers.
Cancellation (IORING_OP_ASYNC_CANCEL): cancel in-flight operations by user_data tag.
Linked SQEs (IOSQE_IO_LINK): ordered execution chains.
IORING_OP_URING_CMD (passthrough): Driver-specific commands via io_uring. NVMe passthrough (nvme_uring_cmd) works through this path. ISLE routes uring_cmd to the KABI driver's command handler, maintaining the same struct nvme_uring_cmd ABI.
IORING_REGISTER_RING_FD: ring self-reference for reduced fd overhead.
IORING_OP_SEND_ZC / RECV_ZC: zero-copy network I/O.

Security improvement over Linux: Per-instance operation whitelist via capabilities. In Linux, io_uring bypasses syscall-level security monitoring (seccomp, audit, ptrace). ISLE allows administrators to restrict which io_uring opcodes are available to each process, addressing this known security gap. The whitelist applies to both standard opcodes and URING_CMD subtypes — an io_uring instance can be restricted to, e.g., read/write only, with NVMe passthrough blocked.

20.6 Signal Handling

Full POSIX and Linux signal semantics:

64 signals: signals 1-31 (standard) and signals 32-64 (real-time)
sigaction with SA_SIGINFO, SA_RESTART, SA_NOCLDSTOP, SA_ONSTACK
sigaltstack for alternate signal stacks
Per-thread signal masks (pthread_sigmask)
Signal delivery by modifying saved register state on the user stack (same mechanism as Linux -- required for correct sigreturn)
Proper interaction with: io_uring (signal-driven completion), epoll (EINTR semantics), futex (interrupted waits), nanosleep (remaining time)
signalfd for synchronous signal consumption
Process groups and session signals (SIGHUP, SIGCONT, SIGSTOP)

20.7 cgroups: v2 Native with v1 Compatibility Shim

Linux problem: cgroups v1 had a messy, inconsistent design with separate hierarchies for each controller. v2 fixed this but migration was painful.

ISLE design: - cgroups v2 only as the native implementation. Single unified hierarchy. - Thin v1 compatibility shim: For container runtimes and tools that still use v1 filesystem paths, provide a v1-compatible view that maps to the v2 backend. This is read/write for the common operations (cpu, memory, io, pids) and read-only/unsupported for obscure v1-only features. - Pressure Stall Information (PSI): Built into cgroup v2 from the start (not added years later like in Linux).

Resource Controllers (Detailed)

Controller	Function	Key Tunables
`cpu`	CPU bandwidth limiting and proportional sharing	`cpu.max`, `cpu.weight`
`cpuset`	CPU and memory node pinning	`cpuset.cpus`, `cpuset.mems`
`memory`	Memory usage limits and OOM control	`memory.max`, `memory.high`, `memory.low`
`io`	Block I/O bandwidth and IOPS limiting	`io.max`, `io.weight`
`pids`	Process/thread count limit	`pids.max`

ISLE-specific controllers: accel (Section 44.2) and power (Section 49.3) follow the same v2 interface conventions.

Delegation Model

Non-root processes can manage sub-hierarchies with CAP_CGROUP_ADMIN, which can be scoped to a specific subtree via the capability system (Section 11). A container runtime holding CAP_CGROUP_ADMIN(subtree=/sys/fs/cgroup/containers/pod-xyz) can manage cgroups under that path but cannot touch anything outside it.

Pressure Stall Information (PSI)

Each cgroup exposes pressure metrics (cpu.pressure, memory.pressure, io.pressure) with 10s/60s/300s averages. PSI supports real-time event notification via poll/epoll triggers. Orchestrators (kubelet, systemd-oomd) use PSI to detect resource saturation before hard limits are hit.

20.8 Event Notification (epoll, poll, select)

Linux applications use three generations of event notification. ISLE implements all three for compatibility but steers new applications toward io_uring (Section 20.5).

epoll (Primary)

Full implementation:

Syscalls: epoll_create1, epoll_ctl (ADD/MOD/DEL), epoll_wait, epoll_pwait, epoll_pwait2.
Trigger modes: Edge-triggered (EPOLLET) and level-triggered (default).
Flags: EPOLLONESHOT (auto-disarm), EPOLLEXCLUSIVE (one waiter per event).
Internal structure: Red-black tree for monitored fds, ready list for events. epoll_wait drains the ready list — no scanning of all monitored fds.
Nested epoll: Cap nesting depth at 5 (matching Linux).

/// Per-epoll-instance state.
pub struct EpollInstance {
    /// Red-black tree of monitored file descriptors.
    pub interests: RBTree<EpollKey, EpollItem>,
    /// Ready list: fds with pending events.
    pub ready_list: LinkedList<EpollItem>,
    /// Wait queue for threads blocked in epoll_wait.
    pub waiters: WaitQueue,
}

poll and select (Legacy)

poll: Array of struct pollfd, O(n) per call. No persistent kernel state.
select: Bitmap-based, limited to 1024 fds. O(n) scan. POSIX compatibility only.
ppoll / pselect: Signal-mask-aware variants.

Event-Oriented File Descriptors

eventfd: Lightweight inter-thread notification via u64 counter.
signalfd: Synchronous signal consumption as fd readability.
timerfd: Timer expiry as fd readability, backed by hrtimer infrastructure.

Relationship to io_uring

io_uring (Section 20.5) supersedes epoll for new high-performance applications. IORING_OP_POLL_ADD provides the same notification within io_uring's unified model.

20a. Futex and Userspace Synchronization

20a.1 Futex Implementation

The futex(2) syscall is the kernel-side primitive underlying all userspace synchronization: glibc pthread_mutex_lock, pthread_cond_wait, sem_wait, and C++ std::mutex all compile down to futex operations. Understanding futex is essential because the fast path never enters the kernel at all -- an uncontended lock is a single atomic compare-and-swap on a shared memory word, entirely in userspace. The kernel is only involved when a thread must sleep (FUTEX_WAIT) or wake sleeping threads (FUTEX_WAKE).

ISLE implements the following futex operations:

Operation	Description
FUTEX_WAIT	Block if `*uaddr == val` (avoids lost-wakeup race)
FUTEX_WAKE	Wake up to N waiters on `uaddr`
FUTEX_WAIT_BITSET	WAIT with 32-bit bitmask for selective wakeup
FUTEX_WAKE_BITSET	WAKE with bitmask (only wake waiters whose mask overlaps)
FUTEX_REQUEUE	Move waiters from one futex to another (condition variables)
FUTEX_CMP_REQUEUE	Requeue with value check (prevents lost wakeups during cond broadcast)
FUTEX_WAKE_OP	Atomic wake + modify (optimizes `pthread_cond_signal` + `mutex_unlock`)

The futex wait queue is organized as a hash table keyed by (address_space_id, virtual_address). Each bucket contains a linked list of waiting tasks:

/// Futex hash key. Combines a key kind with an offset to uniquely identify a futex.
///
/// For **private futexes** (the common case, ~99% of mutex uses): the key is
/// (mm_id, page-aligned vaddr, offset within page). The `offset` field is
/// redundant with vaddr's low bits but kept for uniformity with the shared case.
///
/// For **shared futexes** (MAP_SHARED): the key is (physical page frame, offset
/// within page). Both processes sharing the mapping hash to the same bucket and
/// match on the same (PhysFrame, offset) pair, even if their virtual addresses differ.
///
/// **Matching rule**: Two FutexKeys match iff (kind == kind) AND (offset == offset).
/// For Private, kind equality means same mm_id and same vaddr. For Shared, kind
/// equality means same PhysFrame. The offset is ALWAYS part of the match.
pub struct FutexKey {
    kind: FutexKeyKind,
    /// Offset within the 4K page (0..4095). For private futexes, this equals
    /// (vaddr & 0xFFF). For shared futexes, this is the offset into the physical
    /// page. Critical for correctness: multiple futexes on the same page must NOT
    /// collide (they have different offsets).
    offset: u32,
}

pub enum FutexKeyKind {
    /// Private mapping: keyed by (address space, page-aligned virtual address).
    /// The offset field in FutexKey provides the intra-page position.
    Private { mm_id: MmId, vaddr: VirtAddr },
    /// Shared mapping: keyed by physical page frame.
    /// The offset field in FutexKey provides the intra-page position.
    /// This ensures processes mapping the same file/shm at different virtual
    /// addresses still wake each other correctly.
    Shared { page: PhysFrame },
}

/// A futex waiter node, embedded in the Task struct (Section 12a.1).
/// Uses intrusive linking to avoid heap allocation under spinlock.
/// A task can wait on at most one futex at a time (futex_wait is blocking),
/// so a single embedded FutexWaiter per task is sufficient.
pub struct FutexWaiter {
    /// Intrusive list link (prev/next pointers for the bucket's waiter list).
    pub link: IntrusiveLink,
    /// The futex key this waiter is blocked on (for requeue and wake filtering).
    pub key: FutexKey,
    /// Bitset for FUTEX_WAIT_BITSET selective wakeup (0xFFFF_FFFF = match all).
    pub bitset: u32,
    /// Back-pointer to the owning Task (for wake-up scheduling).
    pub task: *const Task,
}

/// Each bucket is protected by its own spinlock -- contention is spread
/// across the table rather than funneled through a single lock.
///
/// Waiter lists use intrusive linked lists (not `Vec`) to avoid heap
/// allocation under spinlock. FutexWaiter nodes are embedded in the
/// task struct (Section 12a.1, `futex_waiter` field) and linked/unlinked
/// in O(1) with no allocator interaction.
///
/// **Lock hierarchy level**: FUTEX_BUCKET (level 0). This is BELOW all scheduler
/// locks so that futex_wake can safely call scheduler::enqueue() while holding
/// a bucket lock. The authoritative scheduler lock ordering from Section 10 is:
/// TASK_LOCK (level 1) < RQ_LOCK (level 2) < PI_LOCK (level 3).
/// Futex bucket locks are at level 0, allowing the following valid acquisition:
///   1. Acquire FUTEX_BUCKET (level 0)
///   2. Wake waiter, call scheduler::enqueue()
///   3. Scheduler acquires TASK_LOCK (level 1) — valid: 0 < 1
///   4. Release FUTEX_BUCKET
pub struct FutexHashTable {
    buckets: Box<[SpinLock<IntrusiveList<FutexWaiter>, FUTEX_BUCKET>]>,
}

/// Lock level for futex bucket locks. Below TASK_LOCK (level 1) to allow
/// futex_wake → scheduler::enqueue() without lock ordering violation.
pub const FUTEX_BUCKET: LockLevel = LockLevel(0);

/// Futex hash table sizing. Scaled at boot based on system memory:
/// - ≤1 GB: 256 buckets
/// - ≤16 GB: 1024 buckets
/// - ≤256 GB: 4096 buckets
/// - >256 GB: 16384 buckets
/// This matches Linux's scaling heuristic (futex_init in kernel/futex/core.c).
pub const fn futex_hash_size(memory_bytes: usize) -> usize {
    match memory_bytes {
        0..=0x4000_0000 => 256,           // ≤1 GB
        0..=0x4_0000_0000 => 1024,         // ≤16 GB
        0..=0x40_0000_0000 => 4096,        // ≤256 GB
        _ => 16384,                         // >256 GB
    }
}

The hash table size is determined at boot based on available memory (see futex_hash_size above). FUTEX_WAIT atomically checks *uaddr == val while holding the bucket lock, closing the race window between the userspace check and the kernel enqueue.

20a.2 Priority-Inheritance Futexes (PI)

Linux problem: Priority inversion occurs when a high-priority RT task blocks on a mutex held by a low-priority task, while a medium-priority task preempts the lock holder indefinitely. Without intervention, the RT task's latency becomes unbounded.

ISLE design: FUTEX_LOCK_PI and FUTEX_UNLOCK_PI implement kernel-mediated priority inheritance. When an RT task (priority 99) blocks on a PI futex held by a normal task (nice 0), the kernel temporarily boosts the lock holder to priority 99 so it can complete its critical section without being preempted by medium-priority work.

PI chain tracking handles transitive dependencies: if task A (priority 99) waits on a lock held by B (priority 50), and B waits on a lock held by C (priority 10), the kernel walks the chain and boosts C to priority 99. The chain walk is bounded by a compile-time limit (default: 1024 entries) to prevent runaway traversal.

Deadlock detection falls out naturally: if the chain walk encounters the requesting task again (A waits on B waits on A), the kernel returns EDEADLK immediately rather than creating a circular dependency.

PI boosting integrates with all three scheduler classes (Section 14): an EEVDF task can be temporarily boosted into the RT class, and a Deadline task's runtime budget is respected even when boosted. When the lock holder releases the PI futex, its effective priority reverts to the highest priority among any remaining PI dependencies (or its base priority if none remain).

20a.3 Robust Futexes

Linux problem: If a thread crashes or is killed while holding a futex-based mutex, every other thread waiting on that futex blocks forever. The kernel has no way to know the dead thread held the lock because, in the normal case, the kernel never sees the lock/unlock at all (it is purely userspace).

ISLE design (same mechanism as Linux): Each thread maintains a userspace linked list of currently held robust futex locks. The head of this list is registered with the kernel via set_robust_list(). On thread exit (voluntary or involuntary), the kernel walks the robust list and for each entry:

Sets the FUTEX_OWNER_DIED bit (bit 30) in the futex word.
Performs a FUTEX_WAKE on that address, waking one waiter.
The woken thread sees FUTEX_OWNER_DIED, knows the lock state may be inconsistent, and can run recovery logic (or simply re-acquire the lock, clearing the bit).

The robust list walk is bounded (default: 2048 entries) to prevent a malicious thread from pointing the kernel at an enormous or circular list.

20a.4 futex2 (FUTEX_WAITV)

Linux problem: The original futex(2) can only wait on a single address at a time. Waiting on multiple synchronization objects simultaneously required workarounds like polling threads or epoll-over-eventfd bridges -- all of which added latency and complexity.

ISLE design: The futex_waitv() syscall (Linux 5.16+) is supported from day one rather than retrofitted. It accepts an array of (uaddr, val, flags) tuples and blocks until any one of them is triggered:

/// Matches Linux's `struct futex_waitv` (include/uapi/linux/futex.h).
/// The `uaddr` field is a u64 (not a pointer) to match the Linux ABI exactly —
/// this allows 32-bit processes on 64-bit kernels to pass 32-bit addresses
/// without sign-extension issues. The kernel validates the address and
/// interprets it as a `*const AtomicU32` internally.
pub struct FutexWaitv {
    pub val: u64,
    pub uaddr: u64,   // User virtual address (validated by kernel)
    pub flags: u32,    // FUTEX_32, FUTEX_PRIVATE_FLAG, etc.
    pub __reserved: u32,  // Must be zero (Linux ABI compatibility)
}

/// Block until any of the N futex addresses is woken or has a value mismatch.
/// Returns the index of the triggered futex, or -ETIMEDOUT, or -ERESTARTSYS.
pub fn sys_futex_waitv(
    waiters: &[FutexWaitv],
    flags: u32,
    timeout: Option<&Timespec>,
    clockid: ClockId,
) -> Result<usize, Errno> { ... }

Primary consumers: - Wine/Proton: Windows WaitForMultipleObjects maps directly to futex_waitv, enabling efficient game synchronization without per-object polling threads. - Event-driven runtimes: Any pattern where a thread must wait on several independent conditions (e.g., "data ready OR shutdown requested OR timeout").

20a.5 Cross-Domain Futex Considerations

Standard futex implementations assume a single kernel address space. ISLE's domain-based isolation domains (x86-64), POE domains (AArch64), DACR domains (ARMv7), and page-table isolation domains (RISC-V 64, PPC32, PPC64LE) introduce a cross-domain shared-memory scenario that does not exist in Linux.

Shared-memory futex keying: When two processes (or a process and a Tier 1 driver) share memory via MAP_SHARED, the futex key must be the physical address (page frame + offset), not the virtual address, because each domain may map the region at a different virtual address. The FutexKeyKind::Shared variant (see 20a.1) handles this case. Both sides of the mapping hash to the same wait queue bucket, so FUTEX_WAKE from one domain correctly wakes a waiter in the other.

Capability validation: Before performing any futex operation on a shared mapping, the kernel verifies that the calling domain holds a valid capability to the underlying shared memory region. A FUTEX_WAIT or FUTEX_WAKE on an address the caller cannot legitimately access returns EFAULT. This prevents a compromised domain from probing or waking arbitrary futex wait queues in other domains.

MPK interaction (x86-64): The futex word must reside in a page whose PKEY is accessible to both participating domains. In practice, this means the shared memory region is assigned to PKEY 1 (shared read-only descriptors) or PKEY 14 (shared DMA buffer pool), as defined in Section 3. The kernel reads and modifies the futex word from PKEY 0 (ISLE Core), which always has full read/write access to all domains -- so the kernel-side atomic comparison and wake are never blocked by MPK permissions, even if the calling domain's PKRU restricts access to other keys.

Architecture	Isolation mechanism	Futex cross-domain access method
x86-64	MPK (PKEY 0-15)	Kernel operates as PKEY 0; shared region on PKEY 1 or 14
AArch64	POE	Kernel accesses futex word via privileged overlay permission
ARMv7	DACR	Kernel sets domain manager mode for shared page access
RISC-V 64	Page-table isolation	Kernel maps shared page into supervisor address space
PPC32	Segment registers	Kernel maps shared segment with supervisor key access
PPC64LE	Radix PID / HPT	Kernel accesses futex word via hypervisor-privileged mapping

20b. Netlink Event Compatibility

ISLE's native event system (§16b, isle-core) delivers events via capability-gated ring buffers. For compatibility with existing Linux tools that use netlink sockets, isle-compat provides translation layers for the following netlink protocol families:

Netlink Family	Purpose	Key Consumers
`NETLINK_KOBJECT_UEVENT`	Device hotplug events	udev, systemd, mdev
`NETLINK_ROUTE`	Network interface and routing events	iproute2 (`ip`), NetworkManager, systemd-networkd
`NETLINK_AUDIT`	Security audit events	auditd, systemd-journald
`NETLINK_CONNECTOR`	Process events (fork, exec, exit)	systemd, process accounting
`NETLINK_NETFILTER`	Firewall logging and conntrack	iptables logging, conntrack-tools

Architecture: Each netlink family is handled by a dedicated translator in isle-compat:

Process opens a netlink socket (socket(AF_NETLINK, SOCK_DGRAM, protocol)).
isle-compat intercepts the socket creation and bind(), registering the process with the appropriate ISLE event channel.
When the kernel posts a native ISLE event, the translator converts it to the Linux netlink message format and writes to the socket buffer.
Process reads netlink messages via recvmsg().

20b.1 NETLINK_KOBJECT_UEVENT (Device Events)

udev and systemd use this for device hotplug. Example translation:

ISLE Event:
  event_type = UsbDeviceChanged
  data.usb = { vid=0x1234, pid=0x5678, inserted=true }

Netlink message:
  ACTION=add
  DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-1
  SUBSYSTEM=usb
  DEVTYPE=usb_device
  PRODUCT=1234/5678/100

20b.2 NETLINK_ROUTE (Network Events)

NetworkManager, iproute2, and systemd-networkd use this for link state and address changes. The Tier 1 network stack (Section 34) posts native events that isle-compat translates:

RTM_NEWLINK / RTM_DELLINK: Interface added/removed
RTM_NEWADDR / RTM_DELADDR: IP address added/removed
RTM_NEWROUTE / RTM_DELROUTE: Routing table changes
RTM_NEWNEIGH / RTM_DELNEIGH: ARP/NDP neighbor cache updates

20b.3 Other Netlink Families

NETLINK_AUDIT: Translated from ISLE's audit events (Section 24 IMA) for auditd.
NETLINK_CONNECTOR: Translated from process lifecycle events (Section 12a) for cn_proc.
NETLINK_NETFILTER: Translated from nftables/conntrack events (Section 20.4) for firewall logging.

20c. Windows Emulation Acceleration (WEA)

Wine and Proton emulate Windows NT kernel behavior in userspace. This subsystem provides kernel-level NT-compatible primitives that Wine/Proton can use directly, bypassing userspace emulation and achieving better correctness and performance.

Key insight: ISLE doesn't need to implement Windows syscalls directly. Instead, provide kernel-level primitives that make WINE/Proton faster, more correct, and easier to maintain.

Problem: WINE (and Proton) must emulate Windows NT kernel behavior in userspace on top of POSIX/Linux syscalls. This creates: - Performance overhead: Multiple syscalls to emulate one Windows operation - Semantic mismatches: Linux primitives don't map 1:1 to Windows primitives - Correctness issues: WINE's userspace emulation can't perfectly replicate kernel-level Windows behavior - Complexity: WINE's ntdll.dll is ~50K lines of Windows kernel emulation code

ISLE's opportunity: Provide a Windows NT-compatible object model as a kernel subsystem that WINE can use directly, bypassing userspace emulation.

Capability Gating

WEA syscalls (operation codes 0x0800-0x08FF) require CAP_WEA capability. This capability: - Is NOT granted by default — only processes that explicitly request WEA support receive it. - Can be scoped to a specific NT namespace subtree (e.g., CAP_WEA(namespace=/WINE-prefix-1)). - Container isolation: each container (or WINE prefix) has its own \BaseNamedObjects\ subtree. A process with CAP_WEA(namespace=/containers/abc) cannot access objects in /containers/def.

Without CAP_WEA, WEA syscalls return -EPERM. This prevents non-WINE processes from interacting with the NT object namespace and ensures WEA's attack surface is opt-in.

20c.1 NT Object Manager

Windows NT kernel concept: Everything is an object (files, processes, threads, events, mutexes, semaphores, sections). Objects live in a hierarchical namespace (\Device\, \Driver\, \BaseNamedObjects\, etc.).

Current WINE approach: Emulates NT objects in userspace. Server process (wineserver) manages object lifetimes, handles, waits. High overhead for cross-process object sharing.

ISLE WEA approach: Kernel-native NT object manager alongside POSIX VFS.

/// NT Object Manager (lives in isle-compat crate)
pub struct NtObjectManager {
    /// Hierarchical namespace (e.g., \BaseNamedObjects\MyEvent).
    /// Uses RwLock for concurrent read access during lookups.
    ///
    /// **Lock hierarchy**: WEA locks are in a separate "leaf" category that does not
    /// call scheduler code while held. The NT namespace and object locks may call
    /// allocator or capability code but NOT scheduler::enqueue(). This means:
    ///   - NT_NAMESPACE and NT_OBJECT locks do NOT need to be ordered relative to
    ///     scheduler locks (TASK_LOCK, RQ_LOCK, PI_LOCK).
    ///   - They DO need ordering relative to each other: NT_NAMESPACE < NT_OBJECT.
    ///   - They use a separate lock category (WEA_LOCKS) that is incompatible with
    ///     scheduler locks — holding any WEA lock while holding any scheduler lock
    ///     (or vice versa) is a compile-time error.
    ///
    /// Wait operations (WaitForSingleObject, WaitForMultipleObjects) release all
    /// NT object locks before calling scheduler::sleep(). Wake operations
    /// (SetEvent, ReleaseMutex) mark the waiter as ready, then release NT object
    /// locks, then call scheduler::wake() WITHOUT holding NT locks.
    ///
    /// This "release-before-schedule" pattern is identical to how futex_wake works.
    root: RwLock<NtDirectory, { LockCategory::WEA, 0 }>,

    /// Per-process NT handle tables (lazily allocated on first WEA syscall to
    /// avoid ~1.5 MB overhead for non-WEA processes)
    handle_tables: PerProcess<Option<Box<NtHandleTable>>>,
}

/// Lock category for WEA subsystem locks. Separate from scheduler locks.
/// Holding a WEA lock and a scheduler lock simultaneously is forbidden.
pub const WEA_LOCK_CATEGORY: LockCategory = LockCategory::WEA;

/// Lock level within WEA category for namespace directory lock.
pub const NT_NAMESPACE_LEVEL: u8 = 0;
/// Lock level within WEA category for individual NT object internal locks.
pub const NT_OBJECT_LEVEL: u8 = 1;

/// Directory entry in the NT namespace.
pub struct NtDirectoryEntry {
    /// The object (Event, Mutex, etc.)
    object: Arc<NtObject>,
    /// Security descriptor controlling access (simplified from full Windows SD)
    security: NtSecurityDescriptor,
    /// Creation timestamp for audit/debugging
    created_at: Instant,
}

/// Simplified NT security descriptor. Full Windows SDs are complex; we implement
/// the subset needed for WINE/Proton compatibility.
pub struct NtSecurityDescriptor {
    /// Owner (maps to Unix UID via ISLE's capability system)
    owner: UserId,
    /// Container ID for namespace isolation (prevents cross-container squatting)
    container_id: Option<ContainerId>,
}

/// Named object creation with atomic create-or-open semantics.
/// Prevents TOCTOU race conditions in named object access.
impl NtObjectManager {
    /// Create a named object atomically. Returns existing object if name exists
    /// and `open_existing` is true; returns STATUS_OBJECT_NAME_COLLISION if name
    /// exists and `open_existing` is false. This is a single atomic operation
    /// under the directory write lock — no TOCTOU window.
    ///
    /// **Scalability note**: The global RwLock on `root` serializes named object
    /// creation but NOT lookups (which take a read lock) or operations on existing
    /// objects (which don't touch the namespace lock). For gaming workloads where
    /// objects are created once at startup and then operated on, this is acceptable.
    /// If profiling shows namespace lock contention, the design can evolve to:
    /// 1. Per-directory locks (each NtDirectory has its own RwLock)
    /// 2. Concurrent hash map (lock-free lookup, fine-grained insertion lock)
    /// The current design prioritizes simplicity and correctness over scalability.
    pub fn create_named<T: NtObjectType>(
        &self,
        path: &NtPath,
        open_existing: bool,
        access: u32,
        security: NtSecurityDescriptor,
    ) -> Result<(NtHandle, bool /* created */), NtStatus> {
        let mut dir = self.root.write();
        if let Some(existing) = dir.lookup(path) {
            // Check caller has permission to access existing object
            self.check_access(existing, access)?;
            // Check container isolation: object must be in same container or global
            self.check_container_access(existing, &security)?;
            if open_existing {
                return Ok((self.create_handle(existing, access), false));
            } else {
                return Err(STATUS_OBJECT_NAME_COLLISION);
            }
        }
        // Create new object under write lock — atomic with the lookup
        let obj = Arc::new(T::create()?);
        let entry = NtDirectoryEntry {
            object: Arc::clone(&obj),
            security,
            created_at: Instant::now(),
        };
        dir.insert(path, entry);
        Ok((self.create_handle(&obj, access), true))
    }
}

pub enum NtObject {
    Event(NtEvent),
    Mutex(NtMutex),
    Semaphore(NtSemaphore),
    Section(NtSection),        // Memory-mapped file or shared memory
    Process(NtProcess),
    Thread(NtThread),
    Timer(NtTimer),
    IoCompletionPort(NtIocp),
    Job(NtJob),
}

pub struct NtHandleTable {
    /// Handles are indices into this table, not file descriptors.
    /// Heap-allocated fixed-capacity array with maximum 65536 entries (matching
    /// ISLE's CapSpace limit and Linux's RLIMIT_NOFILE default). Attempting to
    /// create handles beyond this limit returns STATUS_INSUFFICIENT_RESOURCES.
    /// Box<[T; N]> is used to avoid stack allocation of ~1MB.
    entries: Box<[Option<NtHandleEntry>; NT_MAX_HANDLES]>,

    /// Bitmap tracking which slots are free, for O(1) allocation.
    /// Also heap-allocated to avoid stack pressure.
    free_bitmap: Box<[AtomicU64; NT_MAX_HANDLES / 64]>,

    /// Windows handles are user-mode pointers (multiple of 4)
    /// We maintain illusion: handle = (index << 2) | 0x4
    next_hint: AtomicU32,  // Hint for next free slot search, not authoritative
}

/// Maximum NT handles per process. Matches ISLE's CapSpace limit (Section 11).
/// Windows default is ~16 million but most applications use far fewer.
pub const NT_MAX_HANDLES: usize = 65536;

pub struct NtHandleEntry {
    object: Arc<NtObject>,
    access_mask: u32,           // Windows ACCESS_MASK
    attributes: u32,            // OBJ_INHERIT, OBJ_PERMANENT, etc.
}

Syscalls provided:

// These are ISLE syscalls, not Windows syscalls
// WINE's ntdll.dll calls these instead of emulating in userspace

SYS_nt_create_event(
    name: *const u16,           // UTF-16 name (Windows convention)
    manual_reset: bool,
    initial_state: bool,
) -> Result<NtHandle>;

SYS_nt_open_event(
    name: *const u16,
    access: u32,
) -> Result<NtHandle>;

SYS_nt_set_event(handle: NtHandle) -> Result<()>;
SYS_nt_reset_event(handle: NtHandle) -> Result<()>;
SYS_nt_pulse_event(handle: NtHandle) -> Result<()>;

SYS_nt_wait_for_single_object(
    handle: NtHandle,
    timeout_ns: Option<u64>,    // Windows uses 100ns units, we convert
) -> Result<WaitResult>;

SYS_nt_wait_for_multiple_objects(
    handles: &[NtHandle],
    wait_all: bool,             // WaitAll vs WaitAny
    timeout_ns: Option<u64>,
) -> Result<WaitResult>;

SYS_nt_create_section(
    name: Option<*const u16>,
    size: u64,
    protection: u32,            // PAGE_READWRITE, PAGE_EXECUTE_READ, etc.
    file: Option<Fd>,           // Back with file or anonymous
) -> Result<NtHandle>;

SYS_nt_map_view_of_section(
    section: NtHandle,
    base_address: Option<*mut u8>,  // NULL = kernel picks
    size: u64,
    offset: u64,
    protection: u32,
) -> Result<*mut u8>;

Benefits for WINE: 1. Performance: Single syscall instead of 5-10 syscalls + wineserver RPC 2. Correctness: Kernel enforces Windows NT semantics exactly 3. Simplicity: WINE's ntdll.dll becomes thin wrapper over ISLE syscalls 4. Cross-process: Named objects work correctly between processes (games + launchers)

20c.2 Fast Synchronization Primitives

Problem: Windows has NtWaitForMultipleObjects (wait on up to 64 objects simultaneously). Linux has no equivalent — WINE emulates with pipes + poll() or wineserver signaling. High overhead.

ISLE WEA approach: Kernel-native multi-object wait.

/// Result of waiting on NT synchronization objects.
/// Windows limits WaitForMultipleObjects to 64 handles (MAXIMUM_WAIT_OBJECTS).
/// This limit is enforced at runtime, not in the type system.
pub enum WaitResult {
    /// One of the waited objects became signaled. The inner value is the
    /// zero-based index of the signaled handle in the input array.
    /// For WaitAll, this is 0 (all signaled, return indicates the first).
    Signaled(usize),
    /// Wait timed out before any object was signaled.
    Timeout,
    /// A mutex was abandoned (owner thread died while holding it).
    /// The inner value is the index of the abandoned mutex.
    /// Windows semantics: the waiter acquires the mutex but should check state.
    Abandoned(usize),
    /// An I/O completion port had a packet available (for alertable waits).
    IoCompletion,
}

impl NtObjectManager {
    /// Wait on multiple objects (events, mutexes, semaphores, threads, processes)
    /// Returns when ANY object becomes signaled (WaitAny) or ALL (WaitAll)
    pub fn wait_for_multiple_objects(
        handles: &[NtHandle],
        wait_all: bool,
        timeout: Option<Duration>,
    ) -> Result<WaitResult> {
        // --- WaitAny semantics ---
        // Register on wait queues for all handles. When ANY object signals,
        // the thread is woken. On wakeup, atomically consume the signaled
        // object (reset auto-reset event, acquire mutex, decrement semaphore).
        // Deregister from all wait queues before returning.

        // --- WaitAll atomicity ---
        // WaitAll requires atomic multi-acquire: either ALL objects are acquired
        // in a single atomic operation, or NONE are. Implementation:
        //
        // 1. Sort handles by object address to establish lock ordering.
        // 2. Acquire each object's lock in sorted order (prevents deadlock).
        // 3. Check if ALL objects are signaled:
        //    - Event: signaled == true
        //    - Mutex: owner == None OR owner == current_thread (recursive)
        //    - Semaphore: count > 0
        //    - Process/Thread: terminated
        // 4. If ALL signaled, atomically consume ALL (reset events, acquire
        //    mutexes, decrement semaphores) while still holding all locks.
        // 5. Release all locks in reverse order.
        // 6. If NOT all signaled, release all locks and block on wait queues
        //    (same as WaitAny). Retry step 1-5 on each wakeup.
        //
        // This two-phase locking ensures no partial acquisition: either the
        // calling thread wins all objects, or it wins none and blocks.
        //
        // Lock ordering: Objects are sorted by their kernel address. This
        // matches Windows NT's implementation and prevents deadlock when
        // multiple threads WaitAll on overlapping handle sets.
    }
}

Why this matters for gaming: - Game engines (Unreal, Unity) use multi-object waits heavily - DirectX11/12 synchronization uses events, mutexes - 5-10x performance improvement over WINE's current userspace emulation

20c.3 I/O Completion Ports (IOCP)

Problem: Windows IOCP is a high-performance async I/O primitive used by game servers, engines. Linux has io_uring but semantics don't match. WINE emulates IOCP poorly.

ISLE WEA approach: Kernel-native IOCP implementation.

/// Maximum pending completion packets per IOCP. Prevents unbounded kernel
/// memory growth from userspace posting. Windows doesn't document a hard
/// limit; we use 64K which exceeds any practical game workload.
pub const NT_MAX_IOCP_PACKETS: usize = 65536;

pub struct NtIocp {
    /// Completion queue (MPMC: many threads post via I/O completion or
    /// PostQueuedCompletionStatus, multiple worker threads consume via
    /// GetQueuedCompletionStatus). The `concurrency` field limits how many
    /// threads can dequeue simultaneously. Bounded to NT_MAX_IOCP_PACKETS;
    /// posting to a full queue returns STATUS_INSUFFICIENT_RESOURCES.
    completion_queue: BoundedMpmcQueue<IocpPacket, NT_MAX_IOCP_PACKETS>,

    /// Associated threads (NT allows binding threads to IOCP)
    concurrency: usize,         // Max threads that can dequeue simultaneously

    /// Wait queue for GetQueuedCompletionStatus
    wait_queue: WaitQueue,
}

pub struct IocpPacket {
    bytes_transferred: u32,
    completion_key: usize,      // User-defined per-handle key
    /// User-provided OVERLAPPED pointer. This is an **opaque token** that the kernel
    /// never dereferences — it is stored on PostQueuedCompletionStatus and returned
    /// unchanged on GetQueuedCompletionStatus. The caller is responsible for ensuring
    /// the pointer remains valid until dequeued. The kernel treats this as a usize
    /// (not a validated UserPtr) because it is purely userspace-to-userspace data flow.
    overlapped: usize,          // Opaque user pointer (NOT dereferenced by kernel)
    status: i32,                // NT status code
}

// Syscalls
SYS_nt_create_iocp(concurrency: usize) -> Result<NtHandle>;

SYS_nt_associate_file_with_iocp(
    file: Fd,
    iocp: NtHandle,
    completion_key: usize,
) -> Result<()>;

SYS_nt_post_queued_completion_status(
    iocp: NtHandle,
    packet: IocpPacket,
) -> Result<()>;

SYS_nt_get_queued_completion_status(
    iocp: NtHandle,
    timeout: Option<Duration>,
) -> Result<IocpPacket>;

Why this matters: - Multiplayer game servers (Rust game servers, Minecraft servers under Wine) - Game engines with async asset loading - Network code in games (sockets + IOCP)

Implementation note: ISLE's existing async I/O (Section 8a, ring buffers and IPC channels) can back this. IOCP is a userspace-visible queue over kernel async I/O.

20c.4 Memory Management Acceleration

Problem: Windows VirtualAlloc, VirtualFree, VirtualProtect have specific semantics that don't map cleanly to mmap/munmap/mprotect: - Reservation vs commit: Reserve address space without allocating pages, commit later - MEM_RESET: Discard pages but keep address range mapped (Linux has MADV_DONTNEED but semantics differ) - Guard pages: PAGE_GUARD causes exception on first access, then becomes normal page - Large pages: MEM_LARGE_PAGES (2MB/1GB pages)

ISLE WEA approach: Extended mmap with Windows-compatible flags.

// Extend existing ISLE mmap syscall with WEA flags
SYS_mmap_wea(
    addr: Option<*mut u8>,
    size: usize,
    protection: u32,            // PAGE_READWRITE | PAGE_EXECUTE_READ | ...
    flags: u32,                 // MEM_RESERVE, MEM_COMMIT, MEM_RESET, MEM_LARGE_PAGES
    fd: Option<Fd>,
) -> Result<*mut u8>;

// New syscalls for Windows-specific ops
SYS_virtual_protect(
    addr: *mut u8,
    size: usize,
    new_protection: u32,
    old_protection: &mut u32,   // Windows returns old protection
) -> Result<()>;

SYS_virtual_lock(
    addr: *mut u8,
    size: usize,
) -> Result<()>;                // Pin pages in RAM (VirtualLock)

Why this matters: - Games use VirtualAlloc for custom allocators - JIT compilers (C#/CLR games) use executable memory allocation - DX12 resource heaps use large page allocations

20c.5 NT Thread Model and Fiber Support

Problem: Windows threads have TEB (Thread Environment Block), fiber contexts (cooperative coroutines), FLS (Fiber Local Storage), and APC (Asynchronous Procedure Call) queues. WINE emulates most of this in userspace; the gaps are performance and correctness of blocking-in-fiber.

ISLE WEA approach: Extend ISLE thread model with NT-compatible TLS and APC support. Fiber support leverages the native ISLE scheduler upcall mechanism (§12a.7) for correct blocking behaviour.

pub struct NtThread {
    /// Standard ISLE thread.
    isle_thread: Arc<Task>,
    /// Thread Environment Block — allocated in user address space.
    /// Kernel records the address for fast NtCurrentTeb() via GS base.
    teb_address: *mut NtTeb,
    /// APC queue (kernel-mode and user-mode APCs). Uses intrusive linked list
    /// to avoid heap allocation under spinlock. Apc nodes are allocated from
    /// a pre-allocated per-thread pool (max 64 pending APCs per thread).
    apc_queue: SpinLock<IntrusiveList<Apc>>,
    /// Pre-allocated APC node pool. Avoids allocator calls under spinlock.
    apc_pool: [MaybeUninit<ApcNode>; NT_MAX_PENDING_APCS],
    apc_pool_bitmap: AtomicU64,  // 64 slots, 1 bit each
}

/// Maximum pending APCs per thread. Windows doesn't document a hard limit,
/// but practical applications rarely exceed a handful.
pub const NT_MAX_PENDING_APCS: usize = 64;

#[repr(C)]
pub struct NtTeb {
    /// NtTib.Self: self-pointer (always TEB[0], offset 0x00 on x64).
    self_ptr:         *mut NtTeb,
    /// NtTib.StackBase / StackLimit: valid stack range for current fiber.
    /// Updated by WINE's SwitchToFiber() — userspace write, no syscall.
    stack_base:       *mut u8,
    stack_limit:      *mut u8,
    /// NtTib.FiberData: pointer to the active fiber's data block.
    /// Updated by WINE on every SwitchToFiber() — userspace write.
    fiber_data:       *mut u8,
    // Kernel maintains these fields at thread creation time.
    // WINE manages the full TEB layout; kernel only guarantees:
    // - TEB is allocated and zeroed to at least 0x1000 bytes (Windows x64 minimum)
    // - GS base points to TEB (x64) or FS base (x86 WoW64)
    // - self_ptr is initialized to TEB address
    // - stack_base/stack_limit are set from thread stack
    // WINE is responsible for populating remaining fields (PEB pointer at 0x60,
    // LastErrorValue at 0x68, TLS array at 0x58, etc.) before first user-mode entry.
}

pub struct Apc {
    routine: extern "C" fn(*mut u8),
    context: *mut u8,
    mode: ApcMode,   // KernelMode vs UserMode
}

// WEA syscalls for APC support.
// SYS_nt_queue_apc returns STATUS_INSUFFICIENT_RESOURCES if the target thread's
// APC pool (64 entries) is exhausted. This is not a Windows-documented limit,
// but practical applications rarely exceed it. WINE can retry or log a warning.
SYS_nt_queue_apc(thread: NtHandle, routine: extern "C" fn(*mut u8), context: *mut u8) -> Result<()>;
SYS_nt_alert_thread(thread: NtHandle) -> Result<()>;
SYS_nt_test_alert() -> Result<bool>;

Fiber kernel responsibilities — what requires kernel involvement and what does not:

Win32 API	Kernel role	Implementation
`ConvertThreadToFiber()`	Allocate upcall stack, call `SYS_register_scheduler_upcall`	WINE calls §12a.7 registration
`CreateFiber(size, fn, p)`	None	WINE allocates stack, sets up `UpcallFrame` in userspace
`SwitchToFiber(fiber)`	None	WINE saves registers, swaps stack pointer, updates TEB.FiberData — pure userspace
`DeleteFiber(fiber)`	None	WINE frees stack
`FlsAlloc` / `FlsGetValue` / `FlsSetValue`	None	WINE maintains per-fiber FLS table in user address space; pointer swapped on SwitchToFiber
Fiber calls blocking syscall	Scheduler upcall (§12a.7)	Kernel invokes upcall; WINE converts to io_uring, parks fiber, runs another

Why blocking-in-fiber is the only hard problem: SwitchToFiber needs zero kernel involvement — it is register save/restore. FLS is an array in user memory. The problem is a fiber calling NtReadFile (→ read(2)) which would block the OS thread, starving all other fibers. The §12a.7 scheduler upcall mechanism solves this: WINE registers an upcall handler on the OS thread; when any fiber's syscall would block, the kernel invokes the handler, which submits the I/O to io_uring and runs the next fiber. The OS thread remains live.

This is exactly how Naughty Dog's fiber-based job system (and similar game-engine job schedulers) achieves high core utilisation — fibers never "waste" a core waiting for I/O or synchronisation.

Why this matters: - Games using Windows fiber-based job systems (Destiny, various Unreal titles) - Windows thread pool APIs (TpCallbackMayRunLong, TP_CALLBACK_ENVIRON) - .NET/C# games (CLR uses APCs for garbage collection suspension) - Anti-cheat systems that inspect TEB/fiber state

20c.6 Security & Token Model

Problem: Windows has security tokens (user SID, group SIDs, privileges). Many games/launchers check tokens. WINE fakes most of this.

ISLE WEA approach: Minimal NT token emulation (not full Windows security, just enough for compatibility).

/// Maximum groups per token. Windows allows up to 1024 groups; we use a lower
/// limit since WINE/Proton games typically need far fewer.
pub const NT_MAX_TOKEN_GROUPS: usize = 128;

/// Maximum privileges per token. Windows defines ~36 privileges; we cap at 64.
pub const NT_MAX_TOKEN_PRIVILEGES: usize = 64;

pub struct NtToken {
    /// User SID (S-1-5-21-...)
    user_sid: WinSid,

    /// Groups (Administrators, Users, etc.). Fixed-capacity array to prevent
    /// unbounded kernel memory growth from malicious token inflation.
    groups: ArrayVec<WinSid, NT_MAX_TOKEN_GROUPS>,

    /// Privileges (SeDebugPrivilege, SeBackupPrivilege, etc.)
    /// Most are no-ops, but games check for them. Fixed-capacity bitset.
    privileges: BitArray<[u64; 1]>,  // 64 bits = 64 privilege slots

    /// Integrity level (Low, Medium, High, System)
    integrity_level: IntegrityLevel,
}

// Syscalls
SYS_nt_open_process_token(
    process: NtHandle,
    access: u32,
) -> Result<NtHandle>;

SYS_nt_query_token_information(
    token: NtHandle,
    class: TokenInformationClass,
    buffer: *mut u8,
    buffer_len: u32,
) -> Result<u32>;                       // Returns bytes written

Why this matters: - Game launchers (Epic, Ubisoft) check admin privileges - Anti-cheat checks process token integrity level - Windows Store games check app container tokens

20c.7 Structured Exception Handling (SEH)

Problem: Windows uses SEH (Structured Exception Handling) for both C++ exceptions and hardware exceptions (access violations, divide-by-zero). x86-64 Windows uses table-based unwinding. WINE emulates via signal handlers.

ISLE WEA approach: Kernel-assisted SEH dispatch with safety bounds.

// When hardware exception occurs (page fault, illegal instruction, etc.):
// 1. Kernel looks up exception handler chain in TEB
// 2. Validates and calls user-mode exception handlers in order
// 3. If unhandled, terminates process (Windows behavior)

pub struct ExceptionRecord {
    exception_code: u32,        // STATUS_ACCESS_VIOLATION, etc.
    exception_flags: u32,
    exception_address: usize,
    parameters: [usize; 15],    // Exception-specific data
}

// When CPU exception occurs, kernel:
// 1. Saves context (registers, stack)
// 2. Reads TEB->ExceptionList (user address, validated)
// 3. For each handler in the chain (max SEH_MAX_CHAIN_DEPTH = 256):
//    a. Validate handler address is in executable user pages
//    b. Validate next pointer is in readable user pages or NULL
//    c. Call handler via controlled user-mode return
//    d. If handler returns EXCEPTION_EXECUTE_HANDLER, unwind to it
// 4. If chain exhausted or max depth reached, terminate process

/// Maximum SEH chain depth to traverse. Prevents infinite loops on
/// corrupted/circular exception lists. Windows doesn't document a limit
/// but practical applications rarely exceed 10-20 handlers.
pub const SEH_MAX_CHAIN_DEPTH: usize = 256;

// No new syscalls needed — kernel handles this automatically
// WINE just needs to set up TEB->ExceptionList correctly
//
// Safety invariants enforced by kernel:
// - Each handler address must be in VMA with PROT_EXEC
// - Each EXCEPTION_REGISTRATION_RECORD must be in readable user memory
// - Chain traversal stops at NULL, invalid pointer, or depth limit
// - Circular chains detected via depth limit (not pointer tracking)

Why this matters: - Windows games compiled with MSVC use SEH - Access violations (common in games with bugs) are handled differently than Linux segfaults - Debuggers need to intercept first-chance exceptions

20c.8 Performance: Projected Comparison

Note: These are design-phase projections, not measured benchmarks. WEA is not yet implemented. The estimates are based on syscall overhead analysis (measuring existing wineserver round-trip vs expected kernel object access latency) and comparable Linux kernel primitives (futex, epoll). Actual performance will be validated during implementation.

Projected workload: Unreal Engine 5 game loading (Proton on Linux vs WEA on ISLE)

Operation	Linux + WINE (est.)	ISLE + WEA (projected)	Projected Speedup
CreateEvent (named)	~15 μs (wineserver RPC)	~1.5 μs (kernel object)	~10x
WaitForMultipleObjects (8 handles)	~8 μs (poll + wineserver)	~0.5 μs (kernel wait)	~16x
VirtualAlloc (100 MB)	~50 μs (mmap + tracking)	~20 μs (native)	~2.5x
IOCP GetQueuedCompletionStatus	~4 μs (eventfd + epoll)	~0.8 μs (kernel queue)	~5x
MapViewOfFile (section)	~12 μs (shm + mmap)	~3 μs (kernel section)	~4x

Assumptions: x86-64, Intel Core i7-12700K, Linux 6.1, WINE 8.x, single-threaded microbenchmarks. Real game workloads will show smaller end-to-end improvements due to GPU-bound and I/O-bound phases.

Projected game impact: 10-20% faster loading (synchronization-heavy), 5-10% better frame pacing (reduced NT emulation jitter). These projections require validation.

20c.9 API Surface & Stability

Key principle: WEA is an internal ISLE syscall API, not a Windows-compatible ABI. WINE/Proton are the only consumers.

Stability guarantee: - WEA operations use the isle_syscall multiplexed entry point (Section 21a.2), with operation codes in the 0x0800-0x08FF range (see updated §21a.3). - Versioned API (WEA v1, v2, etc.) with capability negotiation via isle_op::WEA_VERSION_QUERY. - WINE can check: "Does kernel support WEA v2?" before using new features.

Non-goal: WEA does not aim to run Windows binaries directly. WINE/Proton still required for: - PE executable loading - DLL loading, import resolution - Win32 API emulation (user32.dll, kernel32.dll, etc.) - DirectX → Vulkan translation (DXVK, VKD3D)

WEA only accelerates the kernel-level primitives that WINE currently emulates poorly.

20c.10 Implementation Roadmap

Phased Development Plan (no time estimates per ISLE policy):

Phase 1: NT object manager + basic synchronization - Event, Mutex, Semaphore objects - WaitForSingleObject, WaitForMultipleObjects - Named object namespace

Phase 2: Memory management - VirtualAlloc/VirtualFree with Windows semantics - Section objects (shared memory) - MapViewOfSection, UnmapViewOfSection

Phase 3: I/O completion ports - IOCP creation, association, posting, dequeuing - Integration with ISLE async I/O

Phase 4: Thread model extensions - TEB support + fast NtCurrentTeb() via GS base - APC queues - Scheduler upcall registration (SYS_register_scheduler_upcall, §12a.7) enabling correct fiber blocking behaviour for SwitchToFiber-based job systems

Phase 5: Security & tokens - Minimal NT token emulation - Privilege checks (mostly no-ops)

Phase 6: SEH support - Kernel-assisted exception dispatch - Unwind table parsing (x86-64)

Dependency: WINE/Proton must be modified to use WEA syscalls. Upstream WINE may not accept (they target all UNIX platforms). Proton fork more realistic (Valve controls it, Steam Deck focus).

20c.11 Benefits Summary

For users (projected, pending validation — see §20c.8): - Games projected to run 10-20% faster loading under Proton on ISLE vs Linux - Better compatibility (some games that break on WINE/Linux may work on WEA/ISLE) - Lower input latency (reduced NT emulation jitter)

For WINE/Proton developers: - Less complex userspace emulation code - Fewer bugs (kernel enforces correctness) - Easier to support new Windows features (kernel does heavy lifting)

For ISLE: - Gaming becomes a differentiation point vs Linux - "Best platform for Windows gaming outside Windows" marketing - Drives enthusiast adoption

Market impact: - Steam Deck successor (if Valve interested)? - Gaming-focused ISLE distribution (like SteamOS but ISLE-based)? - Differentiation in the "Linux for gaming" space

20c.12 Open Questions

Upstream WINE acceptance?
WINE targets macOS, FreeBSD, Solaris — not just Linux
ISLE-specific syscalls might not be upstreamable
Solution: Maintain ISLE-specific WINE fork OR Proton-only support
Anti-cheat compatibility?
EAC, BattlEye check kernel behavior
WEA changes kernel behavior (more Windows-like)
Could this improve or break anti-cheat support?
Maintenance burden?
Windows NT is a moving target (Windows 11, Windows 12...)
ISLE must track changes to NT kernel APIs
Mitigation: Focus on stable APIs (NT 6.x kernel, used in Win7-Win11)
Security implications?
NT object namespace shared across processes
Named objects can be hijacked (race conditions)
Resolved: Atomic create-or-open under write lock prevents TOCTOU (see §20c.1 NtObjectManager::create_named). Container isolation via NtSecurityDescriptor prevents cross-container object squatting.
32-bit Windows game support?
Many Windows games are still 32-bit (i686 PE executables)
ISLE does not support ia32 multilib (Section 21 "Deliberately Dropped")
Design decision: 32-bit Windows games run via WINE's WoW64-style thunking. WINE already implements 32-to-64 syscall translation for Linux. WEA syscalls are 64-bit only; WINE's 32-bit ntdll.dll thunks to 64-bit before calling WEA. This maintains ISLE's clean 64-bit-only syscall surface while supporting 32-bit games. Performance impact is minimal: the thunk is one function call in WINE's address space, not a kernel transition.

21. Deliberately Dropped Compatibility

These Linux features are intentionally not supported. Each omission protects a core design property of ISLE.

Dropped feature	Why	Design property protected
Binary `.ko` kernel modules	Would require emulating Linux's unstable internal API	Stable KABI
ia32 multilib (32-bit on 64-bit)	Doubles syscall surface, complicates signal handling	Clean architecture
`/dev/mem` and `/dev/kmem`	Raw physical/kernel memory access	Capability-based security
Obsolete syscalls (~50+)	`old_stat`, `socketcall`, `ipc` multiplexer, etc.	Clean syscall surface
`/sys/module/*/parameters`	Tied to `.ko` module model	KABI-native configuration
Kernel cmdline module params	`modname.param=val` syntax tied to `.ko` model	KABI-native configuration
`ioperm` / `iopl`	Direct I/O port access from user space	Driver isolation
`kexec` (initially)	Complex interaction with driver model	Clean shutdown/recovery

Obsolete syscalls not implemented (partial list): old_stat, old_lstat, old_fstat, socketcall, ipc (multiplexer), old_select, old_readdir, old_mmap, uselib, modify_ldt (except minimal for TLS), vm86, vm86old, set_thread_area (x86 only; use arch_prctl instead).

Only syscalls that current glibc (2.17+) and musl (1.2+) actually emit are implemented.

21a. ISLE Native Syscall Interface

21a.1 Motivation

ISLE implements ~80% of Linux syscalls natively with identical POSIX semantics — read, write, open, mmap, fork, socket, etc. are the kernel's own API. For these, the syscall entry point performs only representation conversion (untyped C ABI → typed Rust internals: int fd → CapHandle<FileDescriptor>, void *buf → UserPtr<T>), not semantic translation.

However, ~20% of operations fall into two categories where Linux's interface is fundamentally inadequate:

Thin adaptation (~15%): Linux has an interface but it's untyped, fragmented, or encodes the wrong abstraction. Examples: ioctl(fd, MAGIC, void*) for driver interaction, clone3() flag explosion for process creation, prctl() as a catch-all for unrelated operations, five separate observability interfaces (perf, ftrace, sysfs, tracepoints, BPF).
No Linux equivalent (~5%): ISLE has capabilities that Linux does not expose at all. Examples: capability delegation with attenuation, isolation domain management, distributed shared memory, per-cgroup power budgets.

For both categories, ISLE defines native syscalls that expose the full richness of the kernel's typed, capability-based model. These syscalls are available alongside the Linux-compatible interface — unmodified Linux applications continue to use Linux syscalls and work correctly; ISLE-aware applications can opt into the native interface for stronger typing, finer-grained control, and access to ISLE-specific features.

21a.2 Design Principles

Native syscalls supplement, never replace, Linux-compatible ones. Every operation achievable via a native syscall must also be achievable via the Linux-compatible interface (even if with less type safety or fewer features). Linux applications never need ISLE-native syscalls.
Typed arguments. Native syscalls use fixed-layout Rust-compatible structs, not unsigned long catch-alls or void * blobs. Every argument is validated at the syscall entry point against the struct layout.
Capability-first. Native syscalls accept CapHandle arguments directly. Permission checks are explicit in the syscall signature, not hidden inside the implementation.
Versioned. Each native syscall struct includes a size: u32 field (like Linux's clone3 and openat2). The kernel handles smaller structs from older userspace by zero-filling new fields. This provides forward-compatible extensibility without syscall number proliferation.
Namespaced. All native syscalls use a single multiplexed entry point (isle_syscall(u32 op, *const u8 args, u32 args_size) -> i64) to consume only one syscall number from the Linux range. The op code selects the operation.

21a.3 Syscall Families

/// ISLE native syscall operation codes.
/// Grouped by subsystem. Each family reserves a 256-entry range for
/// forward-compatible extension without renumbering.
pub mod isle_op {
    // ── Capability operations (0x0100 - 0x01FF) ──────────────────────
    /// Create a new capability with specified rights from an existing one.
    /// Equivalent to: dup() + fcntl() but typed and with attenuation.
    pub const CAP_DERIVE: u32    = 0x0100;
    /// Restrict an existing capability's permissions (irreversible).
    /// No Linux equivalent — fcntl cannot reduce permissions on an fd.
    pub const CAP_RESTRICT: u32  = 0x0101;
    /// Query the permission set of a capability handle.
    pub const CAP_QUERY: u32     = 0x0102;
    /// Revoke a specific capability by handle.
    pub const CAP_REVOKE: u32    = 0x0103;
    /// Delegate a capability to another process via IPC, with optional
    /// attenuation (reduced rights). The recipient receives a new handle
    /// with at most the permissions specified by the sender.
    pub const CAP_DELEGATE: u32  = 0x0104;

    // ── Typed driver interaction (0x0200 - 0x02FF) ───────────────────
    /// Invoke a typed KABI operation on a driver.
    /// Replaces: ioctl(fd, request, arg) with typed, versioned structs.
    /// The driver's KABI version is checked at invocation time.
    pub const DRV_INVOKE: u32    = 0x0200;
    /// Query a driver's supported KABI interfaces and versions.
    pub const DRV_QUERY: u32     = 0x0201;
    /// Subscribe to driver health/status events (structured, typed).
    /// Replaces: various sysfs polling and netlink listening patterns.
    pub const DRV_SUBSCRIBE: u32 = 0x0202;

    // ── Isolation domain management (0x0300 - 0x03FF) ────────────────
    /// Query the isolation tier and domain of a capability handle.
    pub const DOM_QUERY: u32     = 0x0300;
    /// Request domain statistics (cycle counts, fault counts, memory).
    pub const DOM_STATS: u32     = 0x0301;

    // ── Distributed operations (0x0400 - 0x04FF) ─────────────────────
    /// Allocate a distributed shared memory region.
    /// No Linux equivalent.
    pub const DSM_ALLOC: u32     = 0x0400;
    /// Map a remote DSM region into the local address space.
    pub const DSM_MAP: u32       = 0x0401;
    /// Set coherence policy for a DSM region (strict, relaxed, release).
    pub const DSM_SET_POLICY: u32 = 0x0402;
    /// Query cluster membership and node health.
    pub const CLUSTER_INFO: u32  = 0x0410;

    // ── Accelerator operations (0x0500 - 0x05FF) ─────────────────────
    /// Create an accelerator context (GPU, NPU, FPGA) with typed caps.
    /// Replaces: DRM_IOCTL_* and VFIO ioctls with unified typed API.
    pub const ACCEL_CTX_CREATE: u32  = 0x0500;
    /// Submit work to an accelerator context.
    pub const ACCEL_SUBMIT: u32      = 0x0501;
    /// Query accelerator utilization and health.
    pub const ACCEL_QUERY: u32       = 0x0502;
    /// Wait for accelerator fence completion.
    pub const ACCEL_FENCE_WAIT: u32  = 0x0503;

    // ── Power management (0x0600 - 0x06FF) ───────────────────────────
    /// Set per-cgroup power budget (watts).
    /// No Linux equivalent — Linux uses sysfs strings.
    pub const POWER_SET_BUDGET: u32  = 0x0600;
    /// Query current power consumption for a cgroup or domain.
    pub const POWER_QUERY: u32       = 0x0601;

    // ── Observability (0x0700 - 0x07FF) ──────────────────────────────
    /// Subscribe to structured kernel events (health, tracepoints, audit).
    /// Replaces: fragmented perf_event_open / ftrace / sysfs / netlink.
    pub const OBSERVE_SUBSCRIBE: u32 = 0x0700;
    /// Query kernel object by path in the unified object namespace (islefs).
    pub const OBSERVE_QUERY: u32     = 0x0701;

    // ── Windows Emulation Acceleration (0x0800 - 0x08FF) ─────────────
    // WEA operations for WINE/Proton acceleration (Section 20c).
    // These provide kernel-native NT-compatible primitives.

    /// Query WEA version and supported features.
    pub const WEA_VERSION_QUERY: u32     = 0x0800;
    /// Create an NT event object (manual-reset or auto-reset).
    pub const WEA_EVENT_CREATE: u32      = 0x0801;
    /// Open an existing named NT event object.
    pub const WEA_EVENT_OPEN: u32        = 0x0802;
    /// Set (signal) an NT event.
    pub const WEA_EVENT_SET: u32         = 0x0803;
    /// Reset (unsignal) an NT event.
    pub const WEA_EVENT_RESET: u32       = 0x0804;
    /// Pulse an NT event (signal and immediately reset).
    pub const WEA_EVENT_PULSE: u32       = 0x0805;
    /// Create an NT mutex object.
    pub const WEA_MUTEX_CREATE: u32      = 0x0810;
    /// Create an NT semaphore object.
    pub const WEA_SEMAPHORE_CREATE: u32  = 0x0811;
    /// Wait for a single NT object to become signaled.
    pub const WEA_WAIT_SINGLE: u32       = 0x0820;
    /// Wait for multiple NT objects (WaitAny or WaitAll semantics).
    pub const WEA_WAIT_MULTIPLE: u32     = 0x0821;
    /// Create an NT section (memory-mapped file or shared memory).
    pub const WEA_SECTION_CREATE: u32    = 0x0830;
    /// Map a view of an NT section into the process address space.
    pub const WEA_SECTION_MAP: u32       = 0x0831;
    /// Unmap a view of an NT section.
    pub const WEA_SECTION_UNMAP: u32     = 0x0832;
    /// Create an I/O completion port.
    pub const WEA_IOCP_CREATE: u32       = 0x0840;
    /// Associate a file with an IOCP.
    pub const WEA_IOCP_ASSOCIATE: u32    = 0x0841;
    /// Post a completion packet to an IOCP.
    pub const WEA_IOCP_POST: u32         = 0x0842;
    /// Dequeue a completion packet from an IOCP.
    pub const WEA_IOCP_GET: u32          = 0x0843;
    /// Memory operations with Windows semantics (reserve/commit/reset).
    pub const WEA_VIRTUAL_ALLOC: u32     = 0x0850;
    /// Change memory protection with old-protection output.
    pub const WEA_VIRTUAL_PROTECT: u32   = 0x0851;
    /// Lock pages in physical memory.
    pub const WEA_VIRTUAL_LOCK: u32      = 0x0852;
    /// Queue an APC to a thread.
    pub const WEA_APC_QUEUE: u32         = 0x0860;
    /// Alert a thread (deliver queued APCs).
    pub const WEA_ALERT_THREAD: u32      = 0x0861;
    /// Open a process token for security queries.
    pub const WEA_TOKEN_OPEN: u32        = 0x0870;
    /// Query token information (user, groups, privileges).
    pub const WEA_TOKEN_QUERY: u32       = 0x0871;
    /// Close an NT handle.
    pub const WEA_HANDLE_CLOSE: u32      = 0x08F0;
    /// Duplicate an NT handle.
    pub const WEA_HANDLE_DUP: u32        = 0x08F1;
}

21a.4 Userspace Library

Native syscalls are accessed through libisle, a thin userspace library that provides:

C API with proper types (isle_cap_derive(), isle_drv_invoke(), etc.)
Rust bindings via isle-sys crate (zero-cost wrappers over the raw syscall)
Version negotiation: libisle checks kernel version at init and uses the appropriate struct sizes for forward/backward compatibility

Applications link against libisle. The library detects at runtime whether it is running on an ISLE kernel (via /proc/version or uname) and returns -ENOSYS on non-ISLE kernels, allowing portable applications to fall back to Linux-compatible interfaces.

21a.5 Relationship to Linux Syscalls

                    ┌──────────────────────────────────────┐
                    │        Userspace Application         │
                    └───────────┬──────────┬───────────────┘
                                │          │
                    Linux API   │          │  ISLE Native API
                    (glibc)     │          │  (libisle)
                                │          │
                    ┌───────────▼──────────▼───────────────┐
                    │      Syscall Entry Point             │
                    │  ┌──────────┐  ┌──────────────────┐  │
                    │  │ Linux    │  │ isle_syscall()    │  │
                    │  │ nr→      │  │ op + typed args → │  │
                    │  │ dispatch │  │ dispatch          │  │
                    │  └────┬─────┘  └────┬─────────────┘  │
                    │       │             │                 │
                    │       ▼             ▼                 │
                    │  ┌──────────────────────────────┐    │
                    │  │  Internal Typed Kernel API    │    │
                    │  │  (CapHandle, UserPtr, etc.)   │    │
                    │  └──────────────────────────────┘    │
                    └──────────────────────────────────────┘

Both paths converge to the same internal kernel API. A read() via Linux's syscall(0, fd, buf, count) and a hypothetical isle_read() via isle_syscall(op, args, size) call the same internal vfs_read(cap_handle, user_ptr, count) function. The native path skips the fd→CapHandle lookup (the caller already holds a CapHandle) and avoids the void* → UserPtr validation (the struct is pre-typed). For most operations the performance difference is negligible; for high-frequency driver interaction (DRV_INVOKE replacing ioctl), the typed path avoids the ioctl dispatch switch and provides measurably lower overhead.