Sections 47–48 of the ISLE Architecture. For the full table of contents, see README.md.

Part XII: Distributed Systems

Distributed kernel architecture and SmartNIC/DPU integration.

47. Distributed Kernel Architecture

This section extends ISLE from a single-machine kernel to a distributed-capable kernel with RDMA-native primitives, page-level distributed shared memory, cluster-aware scheduling, global memory pooling, and network-portable capabilities. These are kernel-internal capabilities — distributed applications remain in userspace, but the kernel transparently optimizes data placement and movement across RDMA fabric.

Design Constraints:

Drop-in compatibility: A single-node ISLE system behaves identically to a non-distributed kernel. All distributed features are opt-in. Existing Linux binaries (MPI, NCCL, Spark, Redis, PostgreSQL) work without modification.
Superset, not replacement: Standard TCP/IP sockets, POSIX shared memory, and SysV IPC work exactly as before. Distributed capabilities are additional. Applications that opt in (via new interfaces or transparent kernel policies) get better performance.
RDMA-native from day one: The kernel's core primitives (IPC, page cache, memory management) are designed with RDMA transport in mind, not bolted on after the fact.
Page-level coherence: Distributed memory coherence operates at page granularity (4KB minimum), not cache-line granularity (64B). This is the fundamental design decision that makes distributed shared memory practical over network latencies.
Graceful degradation: Node failures are handled. No single point of failure. Split-brain is detected and resolved. Partial cluster operation is always possible.

47.1 Motivation

47.1.1 The Hardware Shift

The datacenter is becoming a single, disaggregated computer:

2015: Machines are islands. 10GbE, TCP/IP, millisecond latencies.
      Networking = slow, unreliable. Kernel = local machine only.

2020: RDMA everywhere. 100GbE RoCEv2 / InfiniBand HDR (200Gb/s).
      1-2 μs latency. Kernel-bypass networking is the norm for HPC.

2024: CXL 2.0 memory pooling. Disaggregated memory over PCIe fabric.
      Memory can live outside the machine, accessible at ~200-400ns.

2025-2026: CXL 3.0 hardware-coherent shared memory. PCIe 6.0 (64 GT/s).
      400GbE / InfiniBand NDR (400Gb/s). Sub-microsecond RDMA.

2027+: CXL switches, memory fabric topology, composable infrastructure.
      The distinction between "local" and "remote" memory blurs.

The hardware is converging on a model where: - Network latency (RDMA) ≈ remote NUMA latency ≈ CXL latency - All three are ~1-5 μs, compared to NVMe SSD at ~10-15 μs - Remote memory over RDMA is faster than local SSD

No existing operating system is designed for this reality.

47.1.2 Why Linux Cannot Adapt

Linux has two completely separate networking paradigms:

World A: Socket-based (kernel-managed)
  - TCP/IP, UDP, Unix sockets
  - Kernel manages connections, buffers, routing
  - Page cache, VFS, block I/O all use this world
  - Latency: ~5 μs per packet (kernel overhead)

World B: RDMA/verbs (kernel-bypass)
  - InfiniBand verbs, RoCE
  - Application manages everything via libibverbs
  - Kernel provides setup (protection domains, memory registration) then gets out
  - Page cache, VFS, block I/O know nothing about this world
  - Latency: ~1-2 μs per operation

These worlds do not interact. There is no way for the Linux page cache
to fetch a page from a remote node via RDMA. There is no way for the
Linux scheduler to migrate a process to where its data lives across
an RDMA link. There is no way for Linux IPC to transparently extend
to a remote node.

Most distributed features in Linux (DRBD, Ceph, GlusterFS, GFS2, OCFS2)
are built on World A (sockets). NFS and SMB have added RDMA transports
(svcrdma/xprtrdma since 2.6.24, SMB Direct since v4.15, KSMBD since
v5.15), but these are bolt-on transport alternatives — the core protocols,
data structures, and failure handling remain socket-oriented. None use
RDMA verbs (CAS, FAA, one-sided Read/Write) for lock-free data structure
access or coherence protocols. ISLE's distinction is using RDMA atomics
and one-sided operations as the primary coordination primitive, not just
as a transport.

Previous attempts to add distributed capabilities to Linux (Kerrighed, OpenSSI, MOSIX, SSI clusters, GAM patches) all failed because:

Linux's core subsystems (MM, scheduler, VFS, IPC) assume single-machine
Patches touched thousands of lines across dozens of subsystems
Every kernel update broke the patches
Cache-line-level coherence over network was too expensive
No clean abstraction boundary — distributed logic was smeared everywhere

47.1.3 ISLE's Structural Advantage

ISLE's existing architecture is uniquely suited for distributed extension:

Existing Feature	Distributed Extension
NUMA-aware memory manager (per-node buddy allocators)	Remote node = distant NUMA node
PageLocationTracker (Section 43.1.5)	Already tracks CPU, GPU, compressed, swap — add RemoteNode
MPSC ring buffer IPC	Ring buffer maps naturally to RDMA queue pair
Capability tokens (generation-based revocation)	Cryptographic signing → network-portable
Device registry (topology tree)	Extend topology to include RDMA fabric
AccelBase KABI (Section 42)	GPU on remote node = remote accelerator
CBS bandwidth guarantees (Section 15)	Extend CBS to cluster-wide resource accounting
Object namespace (Section 41)	`\Cluster\Node2\Devices\gpu0`

The key insight: ISLE already models heterogeneous memory (CPU RAM, GPU VRAM, compressed pages, swap) as different tiers in a unified memory hierarchy. Remote memory over RDMA is just another tier. The memory manager already knows how to migrate pages between tiers on demand. Extending "tiers" to include remote nodes is a natural generalization, not a fundamental redesign.

47.2 Cluster Topology Model

47.2.1 Extending the Device Registry

The device registry (Section 7) models hardware topology as a tree with parent-child and provider-client edges. For distributed operation, extend the tree to span multiple nodes:

\Cluster (root of distributed namespace)
  +-- node0 (this machine)
  |   +-- pci0000:00
  |   |   +-- 0000:41:00.0 (GPU)
  |   |   +-- 0000:06:00.0 (NVMe)
  |   |   +-- 0000:03:00.0 (RDMA NIC, mlx5)
  |   +-- cpu0 ... cpu31
  |   +-- numa-node0 (512GB DDR5)
  |   +-- numa-node1 (512GB DDR5)
  |
  +-- node1 (remote machine, discovered via RDMA fabric)
  |   +-- [remote device tree, cached]
  |   +-- numa-node0 (512GB DDR5, reachable via RDMA)
  |   +-- gpu0 (80GB VRAM, reachable via GPUDirect RDMA)
  |
  +-- node2 ...
  |
  +-- fabric (RDMA fabric topology)
      +-- switch0 (InfiniBand switch)
      |   +-- port0 → node0:mlx5_0
      |   +-- port1 → node1:mlx5_0
      +-- switch1
          +-- port0 → node2:mlx5_0
          +-- port1 → node3:mlx5_0

47.2.2 Cluster Node Descriptor

/// Describes a node in the cluster (including self).
#[repr(C)]
pub struct ClusterNode {
    /// Unique node ID (assigned during cluster join, never reused).
    pub node_id: NodeId,

    /// Node state.
    pub state: NodeState,

    /// RDMA endpoint for reaching this node.
    pub rdma_endpoint: RdmaEndpoint,

    /// Total CPU memory available for remote access (bytes).
    pub remote_accessible_memory: u64,

    /// Number of NUMA nodes.
    pub numa_nodes: u32,

    /// Number of accelerators (GPUs, NPUs, etc.).
    pub accelerator_count: u32,

    /// Round-trip latency to this node (nanoseconds, measured).
    /// u32 covers up to ~4.29 seconds, sufficient for datacenter and
    /// campus-scale clusters. WAN links with higher RTT use the
    /// TCP fallback transport (Section 47.14.6), not RDMA, and are
    /// represented in the distance matrix (Section 47.2.4) instead.
    pub measured_rtt_ns: u32,

    /// Unidirectional bandwidth to this node (bytes/sec, measured).
    /// Field name uses `bytes_per_sec` to avoid ambiguity with
    /// "bps" (bits per second) common in networking contexts.
    pub measured_bw_bytes_per_sec: u64,

    /// Heartbeat: last received timestamp.
    pub last_heartbeat_ns: u64,

    /// Heartbeat: monotonic generation (detects restarts).
    pub heartbeat_generation: u64,


    /// Cluster protocol version (must match to join).
    /// Nodes with mismatched protocol_version are rejected during cluster join.
    pub protocol_version: u32,
    pub _pad: [u8; 32],
}

pub type NodeId = u32;

#[repr(u32)]
pub enum NodeState {
    /// Node is reachable and healthy.
    Active          = 0,
    /// Node is joining (exchanging topology, syncing state).
    Joining         = 1,
    /// Node missed heartbeats but not yet declared dead.
    Suspect         = 2,
    /// Node is unreachable. Its resources are being reclaimed.
    Dead            = 3,
    /// Node is gracefully leaving (draining work, migrating pages).
    Leaving         = 4,
}

#[repr(C)]
pub struct RdmaEndpoint {
    /// RDMA GID (Global Identifier) — InfiniBand/RoCE address.
    pub gid: [u8; 16],
    /// Queue pair number for control channel.
    pub control_qpn: u32,
    /// Protection domain key for this cluster.
    pub pd_key: u32,
    /// RDMA device index on the local machine.
    pub local_rdma_device: u32,
    pub _pad: [u8; 12],
}

Device-local kernels as cluster members — The ClusterNode structure describes traditional CPU-based compute nodes, but modern hardware increasingly runs its own firmware OS:

SmartNICs/DPUs: NVIDIA BlueField-2/3 DPUs run full Ubuntu with 8-16 ARM cores, 16-32 GB DRAM, and can host containers and VMs. Intel IPU and AMD Pensando DPUs run similar firmware stacks.
GPUs: NVIDIA GPUs run CUDA firmware that schedules work across thousands of cores, manages HBM memory, and coordinates P2P transfers. AMD GPUs run ROCm firmware.
Storage controllers: High-end NVMe controllers and RAID cards run embedded RTOS or Linux to manage flash translation layers, wear leveling, and caching.
CXL devices: CXL defines three device types, each with a different operating model in ISLE's multikernel cluster:
Type 1 (coherent compute, no device-managed memory): Device compute participates in the host CPU cache coherency domain via CXL.cache. Natural Mode B peer — ring buffers in shared memory are hardware-coherent without explicit flush. Examples: coherent FPGAs, smart NICs with CXL.
Type 2 (compute + device-managed memory): Both CXL.cache (device cache in host coherency domain) and CXL.mem (device DRAM accessible to host via load/store). The richest ISLE peer type — bidirectional zero-copy coherent access. Device runs ISLE on its embedded cores, Mode B ring buffers are coherent in both directions. Examples: future CXL-attached GPUs, AI accelerators with HBM.
Type 3 (memory expansion, minimal or no compute): Provides additional DRAM via CXL.mem; host sees it as a slower NUMA node. The tiny management processor (if present, typically ARM/RISC-V) acts as a memory-manager peer, not a compute peer: it manages tiering, compression, encryption, and error reporting for the pool, but does not run workloads. Examples: Samsung CMM-H, Micron CZ120, SK Hynix AiMM. See §47.12.5 for the full Type 3 operating model and crash recovery distinction.

Rather than treating these as passive devices controlled exclusively by the host kernel, ISLE's distributed design allows device-local kernels to participate as first-class cluster members. A BlueField-3 DPU running ISLE could:

Join the cluster as a peer node with its own NodeId, exchange topology with other nodes, and participate in membership/heartbeat protocols.
Expose resources in the device registry: its own CPUs, DRAM, and attached storage/network as remotely-accessible resources.
Run workloads: containers or VMs can be scheduled on the DPU's cores, with distributed locking and DSM providing transparent access to host memory or other cluster nodes.
Offload functions: RDMA transport, network filtering, encryption, compression, or storage can run on the DPU with kernel-level coordination via the distributed lock manager and DSM.

This multikernel model treats a single physical server as a cluster of heterogeneous kernels — one on the host CPU, one on each DPU, one on each GPU (if the GPU firmware exposes cluster primitives). The distributed protocols (membership, DSM, DLM, quorum) work identically whether communicating between physical servers or between the host and a DPU on the same PCIe bus.

Protocol requirements — For a device-local kernel to participate as a first-class cluster member, it must implement ISLE's inter-kernel messaging protocol. This is a wire protocol, not an API — the device kernel does not need to be ISLE itself, but it must speak the same language:

Transport layer: RDMA (for network-attached nodes) or PCIe P2P MMIO+interrupts (for on-board devices like DPUs/GPUs). The device must expose:
A control channel for cluster management messages (join, heartbeat, topology sync)
A data channel for DSM page transfers and DLM lock requests
MMIO-mapped doorbell registers or MSI-X interrupts for signaling
Cluster membership protocol (Section 47.3):
Implement the join handshake: authenticate, exchange topology, sync protocol version
Send periodic heartbeats (every 100ms) with monotonic generation counter
Respond to membership queries with node state (Active, Suspect, Dead, Leaving)
Participate in failure detection: mark other nodes as Suspect if heartbeats missed
DSM page protocol (Section 47.5):
Accept page ownership transfer requests: PAGE_REQUEST(vpfn, read|write)
Respond with page data or forward request if not owner
Implement cache coherence state machine (Owner, Shared, Invalid)
Participate in invalidation broadcasts for write requests
DLM lock protocol (Section 47.6):
Accept lock acquisition requests: LOCK_ACQUIRE(lock_id, mode=shared|exclusive)
Maintain lock ownership table and grant/deny based on current holders
Support one-sided RDMA lock operations (atomic CAS on lock words)
Implement deadlock detection timeout (10 seconds default)
Serialization format: All messages use fixed-size binary structs with explicit padding and versioning. Each message has a 32-byte header: rust #[repr(C)] pub struct ClusterMessageHeader { pub protocol_version: u32, // Currently 1 pub message_type: u32, // JOIN_REQUEST, HEARTBEAT, PAGE_REQUEST, etc. pub node_id: NodeId, // Sender's node ID pub sequence: u64, // Message sequence number (for ordering) pub payload_length: u32, // Bytes following this header pub checksum: u32, // CRC32C of header + payload pub _pad: [u8; 8], } Payload structs are defined in Section 47.4 (message formats). All fields are little-endian. Nodes with mismatched protocol_version are rejected during join.

Implementation paths for device vendors:

Path A: Full ISLE on device — Run ISLE kernel on the device's embedded CPU (e.g., BlueField DPU with 16 ARM cores runs ISLE natively). This gives full protocol support with zero extra work. The device becomes a cluster node indistinguishable from a regular server.
Path B: Firmware shim — Device vendor implements a minimal protocol adapter in their existing firmware. The adapter translates ISLE cluster messages into the device's native operations. Example: NVIDIA GPU firmware receives PAGE_REQUEST messages and responds by copying HBM pages to system memory via GPUDirect RDMA. Does not require rewriting the entire firmware stack.
Path C: Host-side proxy driver — A Tier 1 driver on the host acts as proxy, translating ISLE cluster messages into device-specific commands (PCIe transactions, proprietary protocols). The device appears as a cluster member but is actually controlled by the proxy. Lower performance than Path B (extra CPU involvement) but requires zero firmware changes.

Near-term hardware targets — ISLE already builds for aarch64-unknown-none and riscv64gc-unknown-none-elf. Devices with ARM or RISC-V cores can run the ISLE kernel with zero ISA porting work, making Path A immediately actionable:

Device	Cores	ISA	Path	Notes
NVIDIA BlueField-2 DPU	8× Cortex-A72	AArch64	A	Replace host OS with ISLE. PCIe P2P to host. Currently runs Ubuntu.
NVIDIA BlueField-3 DPU	16× Neoverse N2	AArch64	A	Same. Higher bandwidth NIC.
Marvell OCTEON 10 DPU	ARM Neoverse N2	AArch64	A	Open SDK. Same category as BlueField.
Microchip PolarFire SoC FPGA	4× U54	riscv64gc	A	ISLE boot target. FPGA implements custom datapath. Open toolchain.
StarFive JH7110 (VisionFive 2)	4× U74	riscv64gc	A	Boots ISLE today. PCIe expansion for host interconnect.
SiFive Intelligence X280	U74 + RVV	riscv64gc	A	RISC-V vector AI accelerator. ISLE-compatible ISA.
Netronome Agilio CX (NFP3800)	NFP microengines	proprietary	B	Open C/BPF SDK. Published ring interface specs. Implement ISLE ring protocol in NFP firmware.
AMD/Xilinx Alveo U50/U250	FPGA + ARM	AArch64 / FPGA	A or B	Fully programmable. Define any protocol in RTL. ISLE on embedded ARM for Path A.
Samsung SmartSSD	Zynq UltraScale+ (ARM + FPGA)	AArch64	A or B	ARM Cortex-A53 runs ISLE. FPGA handles NVMe datapath. NVMe CSI spec published.
Samsung CMM-H	ARM management core	AArch64	A (mgmt)	CXL Type 3 memory expander. Management core runs ISLE as a memory-manager peer (Type 3 model, §47.12.5). 256 GB–1 TB LPDDR5 pool.
Micron CZ120	ARM management core	AArch64	A (mgmt)	CXL Type 3. Same model as CMM-H. CXL 2.0, 128 GB–512 GB.
SK Hynix AiMM	ARM management core	AArch64	A (mgmt)	CXL Type 3 with in-memory compute (PIM). Management core as memory-manager peer.

The key insight: RISC-V devices are uniquely positioned as zero-effort Path A targets. ISLE already cross-compiles to riscv64gc-unknown-none-elf with OpenSBI boot. Any device with a RISC-V core and OpenSBI can boot an unmodified ISLE kernel — no porting required. ARM-based DPUs (BlueField, OCTEON) are equally zero-effort via the AArch64 build target.

Security boundary — If a device firmware participates as a cluster member, it must be trusted to the same degree as any cluster node. A malicious or compromised GPU firmware with cluster membership could: - Request arbitrary memory pages via DSM (reading sensitive data) - Corrupt shared memory by writing to DSM pages - Initiate denial-of-service by flooding lock requests

Therefore, device cluster membership is disabled by default and enabled per-device:

echo 1 > /sys/bus/pci/devices/0000:41:00.0/isle_cluster_enabled

Only devices running signed, verified firmware (Section 6) should be granted cluster membership in secure environments.

Initially, only host kernels participate. Device participation is a Phase 5-6 capability (Sections 55-56) that requires firmware modifications by hardware vendors or open-source firmware projects. The protocol specification will be published as an RFC-style document to enable third-party implementations.

Real-world precedents: - Barrelfish multikernel OS: Research OS where each CPU core runs its own kernel instance, coordinating via message passing. ISLE generalizes this to heterogeneous hardware. - BlueField DPU offload: Current NVIDIA BlueField firmware can run OVS, storage targets, or custom applications, but coordination with the host is ad-hoc userspace protocols. ISLE provides kernel-level coordination. - GPU Direct Storage: NVIDIA GDS allows GPUs to directly access NVMe storage, bypassing the CPU. This is a point solution. ISLE's model makes such bypasses general-purpose.

Lightweight mode for intra-machine devices — The full distributed protocol (DSM page transfers, DLM with RDMA CAS, quorum protocols) was designed for multi-node clusters over RDMA networks with 1-5 μs latency. For intra-machine devices (host ↔ DPU/GPU on the same PCIe bus), the latency is 10x lower (~200-500ns), but failure modes are different (no network partitions, but devices can hang or reset independently).

Architectural principle: message passing is the primitive. ISLE's IPC model (Section 8a) is built on message passing — explicit ownership transfer, capability-mediated channels, and defined send/receive semantics. This is the architectural primitive. It composes uniformly across every boundary ISLE targets: intra-kernel, kernel-user, cross-process, cross-VM, cross-network, and hardware peer. The "shared memory fast path" described in Mode B below is not a competing model — it is how message passing is implemented locally when hardware cache coherency is available. The abstraction is always message passing; the transport is chosen to match the hardware.

Two coordination modes are supported:

Mode A: Full Distributed Protocol (default for network-attached nodes) - All messages via RDMA or PCIe P2P with ClusterMessageHeader wire format - DSM: explicit page ownership transfer with RDMA Read/Write - DLM: distributed lock tables, RDMA atomic CAS, deadlock detection - Membership: heartbeat every 100ms, suspect timeout 1 second - Best for: Multi-node clusters, devices that may have partial failures

Mode B: Hardware-Coherent Transport (optional for trusted local devices) - The message-passing ownership guarantee is provided by the hardware cache coherency protocol (PCIe ACS, CCIX, or CXL.cache) rather than by the software ownership transfer protocol. The MESI state machine in hardware plays the same role that the software ownership protocol plays over RDMA: only one writer at a time, cached reads see the latest write. No software ownership messages are needed on top of hardware coherency — that would be redundant. - Both host and device map the same physical memory (via PCIe BAR mappings or pinned system memory). Locks use local atomic ops (x86 LOCK CMPXCHG, ARM LDXR/STXR) on cache-coherent shared memory instead of RDMA CAS. - Membership via MMIO doorbell registers (no network heartbeat overhead). - 5-10x lower latency: ~50-100ns for lock acquire vs ~500ns-1μs for RDMA CAS. - Requires: device must be cache-coherent with host (PCIe ATS + ACS, CCIX, or CXL.cache). Non-coherent devices (all current discrete GPUs, most current DPUs) must use Mode A.

When to use each mode:

Scenario	Mode	Reason
Multi-node RDMA cluster	A	Must handle network failures, can't assume cache coherence
BlueField DPU running full ISLE	A	Separate memory space, needs explicit coordination
Future CXL 3.0-coherent GPU	B	CXL.cache coherency makes hardware the ownership protocol
Integrated GPU / APU (UMA)	B	CPU and GPU share coherent on-die/on-package memory
NVMe storage controller	B	Controller and host share command queues in coherent memory
Untrusted/unverified device	A	Mode B relies on hardware coherency guarantee — only for verified hardware

Mode B is an optimization, not a separate protocol. It reuses the same data structures (lock tables, membership records) but the coherency guarantee is provided by hardware instead of software message passing. A device can fall back to Mode A if cache coherence fails or if the device resets.

Selection is per-device at join time:

# Force Mode A (full protocol) for untrusted DPU
echo "rdma://pci:0000:41:00.0?mode=distributed" > /sys/kernel/isle/cluster/join

# Enable Mode B (shared memory) for trusted GPU with cache-coherent access
echo "shmem://pci:0000:03:00.0" > /sys/kernel/isle/cluster/join

Implementation note: Mode B requires PCIe ATS (Address Translation Services) or CXL.cache to ensure device accesses to system memory are cache-coherent. Non-coherent devices (most current GPUs) must use Mode A.

47.2.2.1 Host-Side Component: isle-peer-transport

A device participating as a multikernel peer does not require a traditional device driver on the host. A traditional driver (e.g., mlx5_core, amdgpu, nvme) must understand the device's register layout, command format, initialization sequence, error recovery procedure, and internal resource model. All of that complexity now lives inside the device's own kernel. The host never touches device registers and has no knowledge of the device's internals.

What the host requires instead is a single generic module — isle-peer-transport — that handles the PCIe connection to any ISLE peer device, regardless of what the device actually does:

/// Host-side state for one ISLE peer kernel, maintained by isle-peer-transport.
/// Identical structure for every peer device — NIC, GPU, storage, custom ASIC.
/// The device's function is irrelevant to this layer.
pub struct PeerTransport {
    /// PCIe BDF of the peer device (for unilateral controls, see §47.2.5).
    pub pcie_bdf: PcieBdf,
    /// Shared memory region for the inbound/outbound domain ring buffer pair.
    /// Allocated by the host, mapped into PCIe BAR by device at join time.
    pub ring_region: DmaBuffer,
    /// MMIO doorbell register: host writes here to signal the device.
    pub doorbell_mmio: MmioRegion,
    /// MMIO watchdog register: device writes a counter here every ~10ms.
    pub watchdog_mmio: MmioRegion,
    /// Cluster membership state (shared with the membership protocol §47.3).
    pub health: PeerKernelHealth,
    /// Negotiated cluster protocol version.
    pub protocol_version: u32,
}

isle-peer-transport does five things and nothing else:

Enumerate — detect that the PCIe device at a given BDF exposes the ISLE peer capability register (a new PCI capability ID assigned to the ISLE protocol).
Connect — allocate the shared ring buffer region, map the device's MMIO doorbell, run the cluster join handshake (§47.2.3).
Monitor — poll the MMIO watchdog counter and participate in the heartbeat protocol to detect device failure (§47.2.5.4).
Contain — execute IOMMU lockout, bus master disable, and FLR if the device fails (§47.2.5.5).
Disconnect — handle voluntary CLUSTER_LEAVE (planned update, shutdown).

isle-peer-transport has zero device-specific logic. The same binary handles a BlueField DPU, a RISC-V AI accelerator, a computational storage device, and any future device that implements the ISLE peer protocol. The host kernel's dependency on device-specific code goes from hundreds of thousands of lines (per device class) to zero.

47.2.2.2 Live Firmware Update Without Host Reboot

Because the host has no device-specific driver and no knowledge of device internals, firmware updates on the device are entirely the device's own responsibility. The host is not involved in the update itself — it only observes the device leaving and rejoining the cluster.

Update procedure from the device's perspective:

1. Device decides to update (admin command, automatic policy, or
   device-side health check triggers it).

2. Device sends CLUSTER_LEAVE (orderly departure, not a crash).
   ClusterMessageHeader { message_type: MEMBER_LEAVING, node_id: self }

3. Host receives CLUSTER_LEAVE, executes graceful shutdown:
   - Migrates any workloads running on device to surviving nodes.
   - Drains in-flight IPC channels and DSM operations (waits for
     completions, does not issue new requests to the device).
   - Revokes cluster membership cleanly (no IOMMU lockout, no FLR —
     this is voluntary, not a crash; §47.2.5 crash path not taken).
   Host is fully operational throughout. Zero disruption to workloads
   not using this device.

4. Device updates its own firmware:
   - For Path A (full ISLE kernel): device does a kernel rolling update
     (§52) or reboots its own cores with new kernel image.
   - For Path B (firmware shim): device applies vendor firmware update
     procedure — entirely internal, host has no visibility.
   - For all-in-one firmware: same, internal to device.
   The host cannot observe what happens inside. It only knows the
   device is absent from the cluster.

5. Device reinitializes hardware on new firmware (internal).

6. Device sends CLUSTER_JOIN with new protocol version and capabilities.
   Host authenticates (verifies firmware signature per §22/47.2.3),
   negotiates protocol version, exchanges updated topology.
   Device rejoins — may announce new capabilities (e.g., firmware
   added support for a new DSM extension).

7. Workloads migrate back to device (or new workloads scheduled).

Update cadence is fully independent per device:

Component	Update authority	Host reboot required?
Host kernel	Host admin	No (§52 live evolution) or yes
Device firmware (any path)	Device / device admin	Never
ISLE cluster protocol	Negotiated at join	No (backward-compatible range)
Device hardware capabilities	Announced at rejoin	No

A device can update its firmware multiple times per day. The host never reboots. The host's isle-peer-transport module never changes. The only host-visible event is the device being absent for the duration of the update (seconds to minutes, device-dependent).

For Path A devices running full ISLE, §52 (Live Kernel Evolution) makes it possible to update individual kernel subsystems without even leaving the cluster — no CLUSTER_LEAVE at all. This is a future optimization requiring §52 to be implemented on the device kernel, but it is architecturally reachable.

47.2.2.3 Attack Surface Reduction

The shift from device-specific drivers to a generic peer transport has a significant security consequence. Traditional device drivers run in Ring 0 with full kernel privileges. A single memory-safety bug anywhere in driver code equals full kernel compromise. Driver code is the dominant source of kernel CVEs:

Linux kernel CVE distribution (approximate, 2020-2024):
  ~50% — driver bugs (memory safety, race conditions, use-after-free)
  ~15% — networking stack
  ~10% — filesystem
  ~25% — other subsystems

Lines of driver code per device class (approximate):
  mlx5 (Mellanox NIC):   ~150,000 lines in Ring 0
  amdgpu (AMD GPU):      ~700,000 lines in Ring 0
  i915 (Intel GPU):      ~400,000 lines in Ring 0
  nvme (NVMe storage):   ~15,000 lines in Ring 0

In the ISLE multikernel model:

isle-peer-transport (all devices, combined):  ~2,000 lines in Ring 0
Device-specific code on device:               lives behind IOMMU boundary
                                               cannot reach host kernel memory
                                               even if completely compromised

A vulnerability in device firmware (whether that is a full ISLE kernel, a firmware shim, or an all-in-one firmware) cannot escalate to the host kernel. The IOMMU is the hard boundary (§47.2.5.2). The firmware can be replaced or compromised entirely; the host kernel's critical structures (text, stacks, capability tables, scheduler state) remain unreachable.

The host's trust relationship with a peer device is: 1. At join time: verify firmware signature (§22) — the device presents a signed identity. If the signature is invalid, the join is rejected. 2. During operation: treat all messages as untrusted input — validate message type, version, checksum, and semantic correctness before acting. Same discipline as any network protocol. 3. On failure: IOMMU lockout bounds the damage regardless of what the device firmware does (§47.2.5).

This is a fundamentally different trust model than Ring 0 driver code, which must be trusted completely because there is no boundary between it and the kernel.

47.2.3 Topology Discovery

Cluster formation is explicit (no auto-discovery magic):

1. Admin configures cluster membership:
   echo "rdma://10.0.0.2/mlx5_0" > /sys/kernel/isle/cluster/join

2. Kernel establishes RDMA control channel to target node:
   - Create RDMA queue pair (reliable connected)
   - Perform authenticated key exchange:
     - Each node has an Ed25519 signing key pair (for authentication) and an X25519
       Diffie-Hellman key pair (for key exchange)
     - Nodes exchange X25519 public keys and authenticate them with Ed25519 signatures
     - Shared secret derived via X25519 DH, then HKDF-SHA256 to derive session keys
   - Mutual authentication via pre-shared cluster secret or PKI

3. Exchange topology information:
   - Each node sends its device registry summary (devices, NUMA, accelerators)
   - Each node sends its memory availability (total, available for remote access)
   - Each node sends its RDMA capabilities (bandwidth, latency, features)

4. Measure link quality:
   - Ping-pong latency measurement (RTT)
   - Bandwidth probe (bulk RDMA write)
   - Results stored in ClusterNode.measured_rtt_ns / measured_bw_bytes_per_sec

5. Fabric topology construction:
   - If InfiniBand: query subnet manager for switch topology
   - If RoCEv2: infer topology from latency measurements + LLDP
   - Build cluster-wide distance matrix (for scheduling and placement)

6. Cluster is operational.
   Heartbeat monitoring begins (RDMA send every 100ms per node).

47.2.4 Distance Matrix

The memory manager and scheduler need a cost model for data movement:

/// Cluster-wide distance matrix.
/// distance[src][dst] = cost in nanoseconds to move one page (4KB).
///
/// Examples (measured, not configured):
///   Local NUMA same-node:          ~300ns (cache-hot) to ~3-5μs (cold, typical for migration)
///   Local NUMA cross-socket:       ~500-800ns  (4KB memcpy, QPI/UPI hop)
///   GPU VRAM (PCIe):               ~1000-2000ns (4KB DMA over PCIe)
///   Remote node (same switch):     ~3000-5000ns (RDMA RTT + DMA)
///   Remote node (cross-switch):    ~5000-8000ns
///   CXL-attached memory:           ~200-400ns
///   NVMe SSD:                      ~10000-15000ns
///   Remote NVMe (NVMe-oF/RDMA):   ~15000-25000ns
///   HDD:                           ~5000000ns (5ms)
///
/// Key observation: remote RDMA is faster than local SSD.
/// Design constant: maximum cluster size. 64 nodes fits in a u64 bitmask
/// for efficient set operations (reader sets, membership views, allowed_nodes).
/// For clusters larger than 64 nodes, extend to `[u64; N]` or a hierarchical
/// directory (see Section 47.14.6 open questions).
pub const MAX_CLUSTER_NODES: usize = 64;

/// Heap-allocated via slab (too large for stack: ~49 KB).
///
/// **Allocation**: This struct is ~49 KB and MUST NOT be placed on the kernel stack
/// (which is 8-16 KB). It is allocated from the slab allocator at cluster join time
/// (one instance per cluster membership) and persists for the node's cluster lifetime.
/// The `Box<ClusterDistanceMatrix>` is stored in the `ClusterState` struct.
pub struct ClusterDistanceMatrix {
    /// Node count (including self). Must be <= MAX_CLUSTER_NODES.
    node_count: u32,

    /// Distances[src_node * node_count + dst_node] = page migration cost (ns).
    /// Symmetric for RDMA (full-duplex), but stored both directions
    /// because asymmetric links exist (satellite, WAN).
    distances: [u32; MAX_CLUSTER_NODES * MAX_CLUSTER_NODES],

    /// Bandwidth[src_node * node_count + dst_node] = bytes/sec.
    bandwidth: [u64; MAX_CLUSTER_NODES * MAX_CLUSTER_NODES],

    /// Local NVMe 4KB random read latency in nanoseconds, measured at cluster join
    /// time by calibrate_local_storage(). Used to decide whether to fetch a page
    /// from a remote node or read it from local SSD. Default: 12,000 ns (12 μs) if
    /// no local NVMe is present. Updated on storage topology changes.
    ssd_cost_ns: u64,
}

impl ClusterDistanceMatrix {
    /// Should we fetch a page from a remote node vs. from local SSD?
    /// Returns true if remote fetch is expected to be faster.
    pub fn prefer_remote_over_ssd(&self, local: NodeId, remote: NodeId) -> bool {
        let remote_cost = self.page_cost_ns(local, remote);
        // Measured from local NVMe at cluster join time via a calibration 4KB random
        // read. Default fallback 12,000 ns if no local NVMe present.
        // This field is set by ClusterDistanceMatrix::calibrate_local_storage() called
        // during cluster join (Section 47.11) and updated on storage topology changes.
        let ssd_cost = self.ssd_cost_ns;
        remote_cost < ssd_cost
    }
}

47.2.5 Peer Kernel Isolation and Crash Recovery

This section defines the isolation model for device-local kernels (Section 47.2.2 Path A/B) — devices running ISLE or an ISLE-protocol firmware as a first-class cluster peer. This model is fundamentally different from both Tier 1 driver crash recovery (Section 9) and remote node failure (Section 47.11), and must not be confused with either.

47.2.5.1 The Isolation Model Shift

With traditional drivers (Tier 1/Tier 2), ISLE operates a supervisor hierarchy: the host kernel loads, supervises, and recovers drivers. A Tier 1 driver crash is handled by the kernel — it detects the fault, revokes the domain, issues FLR, and reloads the driver binary. The kernel is always in control.

With a multikernel peer, this hierarchy does not exist. The device runs its own autonomous kernel on its own cores in its own physical memory. The host kernel:

Does not load the device kernel
Cannot inspect or modify the device kernel's private memory
Cannot "reload" the device kernel the way it reloads a Tier 1 driver
Cannot force the device into a known-good state without a full hardware reset

The isolation unit shifts from software domain (MPK/DACR keys, enforced by the kernel) to physical boundary (PCIe bus, IOMMU, enforced by hardware).

47.2.5.2 Host Unilateral Controls

Despite being a peer, the device is physically attached to the host's PCIe fabric. The host retains six unilateral controls it can exercise regardless of the device kernel's state — even against a buggy, wedged, or hostile peer. They form an escalating ladder from surgical containment to full hardware reset:

1. IOMMU domain lockout — The host programmed the device's IOMMU domain at join time. It can modify or revoke that domain at any time. Revoking the IOMMU domain prevents all further device DMA into host memory, hardware-enforced, with no cooperation from the device kernel. This is the primary containment action.

2. PCIe bus master disable — A single MMIO write to the device's Command register clears the Bus Master Enable bit. All PCIe transactions from the device are silently dropped by the root complex. Effective immediately; no cooperation needed.

3. Function Level Reset (FLR) — The host can issue FLR via the PCIe capability register. This resets all device hardware state: the device kernel dies, all in-flight DMA is cancelled, device cores reset. The device must re-initialize and rejoin the cluster. FLR is the multikernel equivalent of "driver reload" but takes device-reboot time (~1-30 seconds depending on firmware initialization) rather than driver-reload time (~50-150ms for Tier 1).

CXL Type 1/2 note: On CXL-attached peers with CXL.cache (coherent cache in the host CPU's coherency domain), the CXL specification requires the device to flush all dirty cache lines back to host memory before the FLR completes. The host must not access the shared memory region until the FLR completion status is confirmed — cache lines in flight during FLR may be in an indeterminate state. ISLE's crash recovery sequence (§47.2.5.5) issues IOMMU lockout and bus master disable (steps 1-2) before FLR (step 7) precisely to ensure no new DMA or cache traffic originates from the device during the flush window.

4. PCIe Secondary Bus Reset (SBR) — The upstream bridge or root port can assert the Secondary Bus Reset bit in its Bridge Control register. This propagates a hard reset signal to all devices on that bus segment — more comprehensive than FLR (resets all PCIe functions on the device, not just one) but less disruptive than a full power cycle. Used when FLR is not supported or does not produce a clean reset.

5. PCIe slot power — The host can power-cycle the PCIe slot via the Hot-Plug Slot Control register or via ACPI _PS3/_PS0 platform methods. This cuts then restores power to the slot entirely. Reserved for devices that do not respond to FLR or SBR.

6. Out-of-band management (BMC/IPMI) — On server-class hardware, the Baseboard Management Controller (BMC) has independent hardware control over PCIe slot power and presence, completely bypassing the host OS. The BMC operates on a separate management plane with its own network interface and power rail. This means that even if the host kernel itself is hung or severely degraded, a management controller can power-cycle misbehaving device slots and restore the host's ability to re-enumerate them. ISLE treats the BMC as a platform capability, not an ISLE component — its availability depends on the hardware platform, and it operates outside ISLE's software control path.

The critical consequence of this escalating ladder is: a heartbeat timeout from a peer device is never a dead end. The host always has a hardware path to reset and re-enumerate the device, independent of device cooperation. The peer kernel model does not introduce any situation where a failed device permanently wedges the host.

These controls are one-sided and non-cooperative. The host can execute them without any message to the device, without the device kernel's acknowledgment, and even while the device kernel is executing arbitrary code. They represent the irreducible minimum of host authority over physically attached devices.

47.2.5.3 Isolation Comparison

Traditional Tier 1 driver:
  Same Ring 0 address space as host kernel.
  Isolation via MPK/DACR — software-enforced, escape possible via WRPKRU.
  Host kernel supervises: detects crash, revokes domain, FLR, reload.
  Recovery: ~50-150ms (driver reload).
  Can corrupt: own domain memory (not host kernel critical structures).

Multikernel peer (Mode A — message passing):
  Separate address space, separate physical DRAM, separate cores.
  Isolation via IOMMU — hardware-enforced, physically unreachable.
  Host kernel does NOT supervise: detects via heartbeat, acts unilaterally.
  Recovery: escalating reset ladder (FLR → SBR → slot power → BMC/IPMI).
  At least one reset path always available; heartbeat timeout is never a dead end.
  Can corrupt: RDMA pool (IOMMU-bounded, excludes kernel critical structures).
  Can leave dirty: distributed state (held locks, DSM page ownership).

Multikernel peer (Mode B — hardware-coherent):
  Same as Mode A for isolation (IOMMU still bounds device DMA).
  Additional risk: in-flight coherent writes to shared pool may be torn.
  Recovery: same as Mode A + pool scan and lock force-release.

Remote cluster node (network-attached):
  No physical connection. No unilateral PCIe controls.
  Isolation: network boundary, no shared memory at all.
  Recovery: membership revocation, DSM page invalidation.
  Can corrupt: nothing on host (messages-only interface).

The key asymmetry: a peer kernel crash is more isolated than a Tier 1 driver crash (no shared Ring 0 address space, IOMMU is harder than MPK), but less controlled (no supervisor relationship, no fast reload, possible dirty distributed state).

47.2.5.4 Crash Detection

Two detection paths run in parallel for attached peer kernels:

Primary: Cluster heartbeat (both Mode A and Mode B) The standard membership protocol (Section 47.3): heartbeat every 100ms. Missed for 1 second → Suspect. Missed for 3 seconds → Dead. This covers all failure modes including silent crashes and wedged firmware.

Secondary: MMIO watchdog (required for Mode B; recommended for Mode A) The device firmware writes a monotonically increasing counter to a dedicated MMIO register every 10ms. The host polls this register on any DLM lock acquisition timeout or DSM page request timeout. Stale counter → immediate Suspect escalation. Detection latency: ~20ms instead of up to 1 second via heartbeat alone.

Mode B devices that do not implement the MMIO watchdog must not be granted hardware-coherent transport access. The faster detection is mandatory for Mode B because stale lock state in the coherent pool propagates to all cluster members holding locks in that region.

/// Per-peer crash detection state, maintained by the host kernel.
pub struct PeerKernelHealth {
    /// NodeId of the peer kernel.
    pub node_id: NodeId,
    /// Last observed MMIO watchdog counter value.
    pub last_watchdog_count: AtomicU64,
    /// Timestamp of last watchdog read (nanoseconds since boot).
    pub last_watchdog_ts_ns: AtomicU64,
    /// Number of consecutive missed heartbeats.
    pub missed_heartbeats: AtomicU32,
    /// Current health state.
    pub state: AtomicU32, // PeerHealthState enum
    /// PCIe BDF for unilateral control operations.
    pub pcie_bdf: PcieBdf,
    /// Coordination mode (determines recovery procedure).
    pub mode: PeerCoordinationMode, // ModeA | ModeB
}

#[repr(u32)]
pub enum PeerHealthState {
    Active  = 0,
    Suspect = 1, // Missed heartbeats or stale watchdog — alert, escalate
    Dead    = 2, // Confirmed failed — execute recovery sequence
    Faulted = 3, // FLR issued, waiting for device to rejoin
}

47.2.5.5 Recovery Sequence

On transition to Dead, the host kernel executes the following sequence. Steps 1-3 are unilateral and non-cooperative. Steps 4-6 involve cleanup of distributed state. Steps 7-8 are optional and admin-controlled.

Peer kernel crash recovery sequence:

1. IOMMU lockout (immediate, unilateral)
   host.iommu.revoke_domain(peer.pcie_bdf)
   — All device DMA into host memory blocked in hardware.
   — All outstanding RDMA operations targeting this device's QPs are cancelled.
   — Estimated time: <1ms (IOMMU domain table update + TLB shootdown).

2. PCIe bus master disable (immediate, unilateral)
   host.pcie.clear_bus_master(peer.pcie_bdf)
   — Device can no longer initiate any PCIe transactions.
   — Defense-in-depth: belts-and-suspenders with IOMMU lockout.
   — Estimated time: <1ms.

3. [Mode B only] Pool scan and lock force-release
   host.rdma_pool.scan_and_release_locks(peer.node_id)
   — Scan all DLM lock entries in the coherent pool owned by the dead peer.
   — Force-release each held lock; set tombstone flag (LOCK_OWNER_DEAD).
   — Waiters receive LOCK_OWNER_DEAD, not timeout.
   — Scan all DSM directory entries owned by peer; mark as LOST.
   — Estimated time: O(active_locks) — typically <10ms.

4. Cluster membership revocation
   cluster.revoke_membership(peer.node_id)
   — Broadcasts MEMBER_DEAD(peer.node_id) to all other cluster nodes.
   — All nodes destroy their QP connections to the peer.
   — Rkey for the peer's RDMA pool is invalidated.
   — Estimated time: ~1 RTT to notify all members (~1-3ms on LAN).

5. DSM page invalidation
   dsm.invalidate_all_pages_owned_by(peer.node_id)
   — All DSM pages in the Owner or Shared state for the dead peer are
     invalidated. Processes mapping these pages receive SIGBUS on next
     access (same semantics as NFS server failure).
   — Home-node responsibilities for pages hosted on the peer are
     migrated to surviving nodes via the migration protocol (Section 47.5.3).
   — Estimated time: O(pages_owned_by_peer) — typically <100ms.

6. Capability revocation
   cap_table.revoke_all_for_node(peer.node_id)
   — All capabilities granted to or from the peer are invalidated.
   — Ongoing ring buffer channels to the peer are torn down.
   — Estimated time: O(capabilities) — typically <10ms.

7. [Optional] Reset and device reboot (admin-controlled or auto-policy)
   Escalating ladder — attempt each step in order, stop when device rejoins:
   a. FLR:  host.pcie.issue_flr(peer.pcie_bdf)
   b. SBR:  host.pcie.issue_sbr(peer.pcie_bdf)   // if FLR fails or unsupported
   c. Slot power cycle: host.pcie.power_cycle(peer.pcie_bdf)  // last software resort
   d. BMC/IPMI slot power: platform.bmc.power_cycle_slot(peer.slot_id)  // OOB, if available
   — Resets all device hardware state; device firmware re-initializes.
   — If device re-joins cluster: resume normal operation.
   — If device fails to re-join within timeout after all steps: mark as Faulted,
     notify admin, do not attempt further resets automatically.
   — At least one reset path (FLR or SBR) is always available for PCIe devices.

8. [Optional] Workload migration
   scheduler.migrate_workloads_from(peer.node_id, policy)
   — Workloads that were running on the peer's compute resources
     (containers, VMs, scheduled tasks) are either:
     a. Migrated to surviving nodes if they were checkpointable.
     b. Terminated with SIGKILL if not checkpointable.
     c. Suspended pending device recovery if short outage expected.
   — Policy configured per-cgroup: migrate | terminate | suspend.

Total time to containment (steps 1-2): < 2ms. Total time to clean state (steps 1-6): < 200ms typical. Total time to full recovery (steps 1-8 with FLR): 1-30 seconds (device-dependent).

Compare to Tier 1 driver reload: 50-150ms total. The peer kernel recovery is slower because FLR + firmware re-initialization cannot be parallelized, and because distributed state cleanup (steps 4-6) is inherently network-speed rather than local-memory-speed. Steps 1-3 (containment) are comparable to or faster than Tier 1 domain revocation.

CXL Type 3 crash — distinct recovery model: When a CXL Type 3 memory-expander peer loses its management processor (crash, firmware fault, or reset), the recovery is fundamentally different from a compute peer crash:

The DRAM cells do not disappear. The physical memory remains accessible to the host via CXL.mem load/store — it is DRAM on a PCIe/CXL bus, not in the device's compute domain. The host can still read and write those pages.
The management layer is gone. Tiering decisions, compression metadata, encryption keys (if any), and bad-page tracking maintained by the management processor are no longer available. Compressed pages are unreadable until decompressed; encrypted pages are inaccessible if keys were held in device DRAM.
Recovery action: The host marks the CXL pool as ManagementFaulted (not Dead). Pages without compression or encryption remain accessible normally. Pages with compression or encryption are migrated or treated as lost. The management processor is reset (FLR → re-initialize firmware); once recovered, it re-scans the pool metadata and rejoins as a memory-manager peer.
No workload migration needed: There are no workloads running on a Type 3 management processor. Workloads using the CXL pool memory continue running unaffected (if their pages were uncompressed/unencrypted). Only pool management operations (tiering, new allocation) are suspended during recovery.

The PeerKernelHealth state machine (§47.2.5.4) gains an additional state for Type 3 peers: ManagementFaulted — memory accessible, management unavailable.

47.2.5.6 What Survives Peer Kernel Crash Intact

The host kernel's own state is fully protected:

Component	Protected by	Survives peer crash?
Host kernel text, rodata	IOMMU (device cannot reach)	Always
Host kernel stacks	IOMMU (not in RDMA pool)	Always
Capability tables	IOMMU (not in RDMA pool)	Always
Scheduler state	IOMMU (not in RDMA pool)	Always
Host application memory	IOMMU (not in RDMA pool unless explicitly exported)	Always
RDMA pool — kernel structures	IOMMU bounds; pool scan on Mode B crash	Recovered in step 3
RDMA pool — application DSM pages	Invalidated (step 5); app gets SIGBUS	Lost (must re-fetch or re-compute)
Other cluster nodes' state	Steps 4-6 clean up distributed references	Recovered
Host kernel stability	Nothing to crash	Never affected

The host kernel never crashes due to a peer kernel crash. The IOMMU is the hardware guarantee; the recovery sequence is the software cleanup.

47.2.5.7 Relationship to Other Failure Handling Sections

Section 9 (Driver Crash Recovery): Covers Tier 1/Tier 2 driver failures where the host kernel is the supervisor. Peer kernel isolation is a peer relationship, not a supervisor relationship. Do not apply Section 9 procedures to peer kernels.
Section 47.11 (Cluster Membership and Failure Detection): Covers remote nodes connected over RDMA networks. Peer kernel recovery shares the membership protocol (steps 4-6) but adds the unilateral PCIe controls (steps 1-3) that are not available for network-attached nodes.
Section 48.6 (DPU Failure Handling): Covers the §48 offload tier model where the DPU acts as a managed device via the OffloadProxy KABI. If the same DPU runs a full ISLE peer kernel (Section 47.2.2 Path A), this section applies instead. The two models are mutually exclusive for a given device.

47.3 RDMA-Native Transport Layer

47.3.1 Design: Kernel RDMA Transport (isle-rdma)

Unlike Linux where RDMA is a separate subsystem used only by applications, ISLE integrates RDMA into the kernel's transport layer so that any kernel subsystem can use RDMA for data movement.

Linux architecture:
  ┌───────────────────────────────────────────────────┐
  │  Kernel subsystems (MM, VFS, IPC, scheduler)      │
  │  └── All use: memcpy, TCP sockets, block I/O      │
  │      (no RDMA awareness)                           │
  ├───────────────────────────────────────────────────┤
  │  RDMA subsystem (ib_core, ib_uverbs)              │
  │  └── Only used by: userspace apps via libibverbs   │
  │      (kernel subsystems don't use this)            │
  └───────────────────────────────────────────────────┘

ISLE architecture:
  ┌───────────────────────────────────────────────────┐
  │  Kernel subsystems (MM, VFS, IPC, scheduler)      │
  │  └── All use: isle-rdma transport (when remote)    │
  │      Local path: same as before (memcpy, DMA)      │
  │      Remote path: RDMA read/write via isle-rdma    │
  ├───────────────────────────────────────────────────┤
  │  isle-rdma: unified RDMA transport for kernel use  │
  │  ├── Page migration (MM → remote NUMA node)        │
  │  ├── Page cache fill (VFS → remote page cache)     │
  │  ├── IPC ring (IPC → remote process)               │
  │  ├── Control messages (scheduler, capabilities)    │
  │  └── Userspace RDMA (libibverbs compat, unchanged) │
  ├───────────────────────────────────────────────────┤
  │  RDMA hardware driver (mlx5, efa, bnxt_re, etc.)  │
  │  └── Implements RdmaDeviceVTable (Section 46) │
  └───────────────────────────────────────────────────┘

47.3.2 Transport Abstraction

// isle-core/src/rdma/transport.rs (kernel-internal)

/// Unified transport for kernel-to-kernel communication.
/// Abstracts over RDMA hardware and provides the primitives that
/// other kernel subsystems (MM, IPC, page cache) use.
pub struct KernelTransport {
    /// RDMA device used for kernel communication.
    rdma_device: DeviceNodeId,

    /// Protection domain for all kernel RDMA operations.
    kernel_pd: RdmaPdHandle,

    /// Pre-registered memory regions for fast page transfer.
    /// Covers designated RDMA-eligible regions (default: 25% of RAM).
    /// Avoids per-transfer memory registration overhead.
    kernel_mr: RdmaMrHandle,

    /// Per-remote-node connections (reliable connected QPs).
    // O(1) indexed by NodeId. MAX_CLUSTER_NODES=64, so this array is ~N*sizeof(NodeConnection) bytes.
    connections: [Option<NodeConnection>; MAX_CLUSTER_NODES],

    /// Per-CPU completion queues, one per logical CPU.
    /// A single shared CQ becomes a bottleneck under load because all CPUs
    /// contend to drain it. Per-CPU CQs allow each CPU's RDMA poll thread
    /// to drain its own queue independently, eliminating cross-CPU contention.
    /// Each QP in NodeConnection is created against the CQ of the CPU that
    /// owns that connection's polling thread (assigned at connection setup time).
    /// Indexed by cpu_id; length = num_online_cpus() at transport init.
    /// Uses a slab-allocated slice rather than a fixed array: the kernel has no
    /// compile-time MAX_CPUS (Section 14.1) — CPU count is runtime-discovered.
    /// Allocated once at transport init, never resized on the hot path.
    cqs: SlabSlice<RdmaCqHandle>,

    /// Statistics.
    stats: TransportStats,
}

pub struct NodeConnection {
    /// Reliable connected queue pair for control messages.
    control_qp: RdmaQpHandle,

    /// Reliable connected QP for one-sided RDMA (Read/Write/Atomic).
    /// RDMA Read/Write requires RC or UC transport; UD only supports Send/Receive.
    data_qp: RdmaQpHandle,

    /// Remote node's memory region key (for RDMA read/write).
    remote_rkey: u32,

    /// Remote node's base address for page transfers.
    remote_base_addr: u64,

    /// Flow control: outstanding RDMA operations.
    inflight: AtomicU32,

    /// Maximum concurrent RDMA operations to this node.
    max_inflight: u32,
}

/// Operations available to kernel subsystems.
impl KernelTransport {
    /// RDMA Read: fetch a page from remote node to local memory.
    /// One-sided — no remote CPU involvement.
    /// Used by: MM (page fault on remote page), page cache (remote fill).
    pub fn fetch_page(
        &self,
        remote_node: NodeId,
        remote_phys_addr: u64,
        local_page: PhysAddr,
        size: u32,             // Usually 4096 (4KB) or 2097152 (2MB huge page)
    ) -> Result<RdmaCompletion, TransportError>;

    /// RDMA Write: push a page from local memory to remote node.
    /// One-sided — no remote CPU involvement.
    /// Used by: MM (page eviction to remote node), DSM (writeback).
    pub fn push_page(
        &self,
        remote_node: NodeId,
        local_page: PhysAddr,
        remote_phys_addr: u64,
        size: u32,
    ) -> Result<RdmaCompletion, TransportError>;

    /// RDMA Atomic Compare-and-Swap (64-bit).
    /// Used by: Distributed Lock Manager (Section 31a) uncontested acquire,
    /// DSM page ownership transfers.
    pub fn atomic_cas(
        &self,
        remote_node: NodeId,
        remote_addr: u64,
        expected: u64,
        desired: u64,
    ) -> Result<u64, TransportError>;

    /// Send a control message to remote node (two-sided, requires remote CPU).
    /// Used by: cluster management, capability distribution, scheduler.
    pub fn send_control(
        &self,
        remote_node: NodeId,
        msg: &ControlMessage,
    ) -> Result<(), TransportError>;

    /// Batch page transfer: move N pages in a single RDMA operation.
    /// Used by: bulk migration, prefetch, process migration.
    pub fn fetch_pages_batch(
        &self,
        remote_node: NodeId,
        pages: &[(u64, PhysAddr)],   // (remote_phys, local_phys) pairs
    ) -> Result<RdmaBatchCompletion, TransportError>;
}

47.3.3 Pre-Registered Kernel Memory

The biggest performance problem with Linux RDMA is memory registration. Before any RDMA operation, memory must be pinned and registered with the NIC hardware (translated to physical addresses, programmed into NIC's address translation table). This costs ~1-10 μs per registration.

ISLE avoids per-transfer registration overhead by pre-registering designated RDMA-eligible memory regions at cluster join time:

Boot sequence:
  1. RDMA NIC driver initializes (standard KABI init).
  2. Cluster join is requested.
  3. isle-rdma registers RDMA-eligible memory regions:
     - A configurable pool (default: 25% of RAM) for DSM pages and RDMA buffers
     - Pool size set via /sys/kernel/isle/cluster/rdma_pool_percent (1-75%)
     - Single memory region per pool, one-time cost (~100 μs)
     - NIC can DMA to/from registered pages without per-transfer registration
     - Non-registered memory cannot be accessed by remote nodes
  4. Remote nodes exchange rkeys (remote access keys).
  5. Any kernel subsystem can now RDMA read/write registered pages with zero
     setup cost. Pages outside the RDMA pool are not remotely accessible.

This is safe because:
  - Only designated RDMA-eligible regions are registered (not all physical memory)
  - The rkey is only shared with authenticated cluster members
  - RDMA access is gated by the connection (reliable connected QP)
  - The kernel controls which pages are exported (via page ownership tracking)
  - IOMMU still validates all DMA (RDMA NIC goes through IOMMU like any device)
  - Kernel text, kernel stacks, and security-sensitive structures are never in
    the RDMA pool

47.3.4 Performance Characteristics

Operation	Latency	Bandwidth	CPU Involvement
RDMA Read (4KB page)	~2-3 μs	~200 Gb/s line rate	Zero on remote side
RDMA Write (4KB page)	~1-2 μs	~200 Gb/s line rate	Zero on remote side
RDMA Atomic CAS	~2-3 μs	N/A	Zero on remote side
Control message (send/recv)	~2-3 μs	N/A	Interrupt on remote side
Batch page transfer (64 pages)	~5-10 μs	Near line rate	Zero on remote side
Memory registration (avoided)	0 (pre-registered)	N/A	N/A

Compare with alternatives: | Alternative | 4KB Fetch Latency | Notes | |------------|-------------------|-------| | Local DRAM (same NUMA) | ~300-500 ns | Baseline (4KB page copy) | | Local DRAM (cross-NUMA) | ~500-800 ns | QPI/UPI hop (4KB page copy) | | RDMA (same switch) | ~3 μs | 100x DRAM, but... | | CXL 2.0 pooled memory | ~200-400 ns | Hardware-coherent | | NVMe SSD | ~10-15 μs | Current swap target | | NFS/CIFS (TCP) | ~50-200 μs | Kernel networking overhead | | HDD | ~5,000 μs | Rotational latency |

Key insight: Raw RDMA Read latency (~3 μs) is 3-5x faster than NVMe (~10-15 μs). However, the complete DSM page fault path (directory lookup + ownership negotiation + RDMA transfer) totals ~10-18 μs (see Section 47.5.4), which is comparable to NVMe latency rather than 3-5x faster. The advantage of remote DRAM over NVMe is not single-fault latency but bandwidth scalability: RDMA provides ~25 GB/s per port (200 Gb/s InfiniBand) vs. ~7 GB/s for a single NVMe SSD, and multiple RDMA ports can be aggregated. Remote memory is a better "swap" target than local disk for bandwidth-bound workloads.

47.3.5 Security Considerations

The pre-registered RDMA pool approach (Section 47.3.3) creates a security trade-off: any node with the rkey can read/write any address within the RDMA-eligible region on the remote node via one-sided RDMA. The attack surface is limited to the RDMA pool (default 25% of RAM), not all physical memory — kernel text, stacks, and security structures are excluded.

Explicit trust model assumption: All nodes in an ISLE cluster are mutually trusted kernel instances, authenticated during cluster join (X25519 key exchange authenticated with Ed25519 signatures, Section 47.2.3). A compromised node can read/write the RDMA pool of any other node. This is the same trust model used by all production InfiniBand and RoCE deployments today. The cluster join authentication is the trust boundary — once joined, nodes are peers. If a node is suspected of compromise, its cluster membership is revoked (Section 47.11), destroying all QP connections and invalidating its RDMA mappings immediately.

Mitigation 1: RDMA Memory Windows (MW Type 2) — For security-sensitive or multi-tenant deployments, use RDMA Memory Windows instead of pool-wide registration. Each page export creates a short-lived memory window with a unique rkey, scoped to the specific page or region being transferred. The window is revoked when page ownership changes. This adds ~0.5-3μs overhead per window create/destroy (HCA-dependent; ConnectX-5/6 MW Type 2 bind/invalidate involves firmware interaction and PCIe round-trips) but eliminates pool-wide exposure. A compromised node can only access pages for which it currently holds a valid memory window, not the entire 25% pool.

Mitigation 2: Rkey rotation — In trusted mode, the pool-wide rkey is rotated periodically (default: every 60 seconds). All nodes exchange new rkeys via the authenticated control channel. Old rkeys are invalidated after a grace period (2x rotation interval = 120s). This limits the window of exposure if an rkey is leaked outside the cluster: a revoked node retains stale access for at most 180 seconds (rotation interval + grace period). For deployments where this window is unacceptable (multi-tenant, zero-trust), use Mitigation 1 (RDMA Memory Windows) instead, which provides immediate per-page revocation at the cost of higher per-operation overhead.

Mitigation 3: Trusted cluster mode (default) — For single-tenant, physically secured clusters (the common datacenter case), the pool-wide registration with rkey rotation is the default. This matches current InfiniBand practice — all production RDMA deployments assume a trusted fabric.

Mitigation 4: Hardware memory encryption — Note: Total Memory Encryption (Intel TME-MK, AMD SME) operates at the memory controller boundary and does NOT encrypt data visible to DMA devices. RDMA NICs access memory through the IOMMU in the CPU's coherency domain and see plaintext. TME-MK/SME protects against physical DRAM extraction attacks (cold boot) but does not provide defense-in-depth against software-based RDMA attacks. For RDMA data-in-transit protection, use the AES-GCM encryption described above.

Acknowledged limitation: All four mitigations have trade-offs. MW Type 2 adds latency to every page transfer. Rkey rotation adds periodic coordination overhead. Trusted cluster mode requires a physically secured fabric. Hardware encryption prevents cross-node data snooping but doesn't prevent a compromised node from writing garbage to remote memory. A fully zero-trust RDMA solution would require per-operation cryptographic MACs. For RoCE (RDMA over Converged Ethernet), MACsec (IEEE 802.1AE) provides link-layer encryption and integrity at the Ethernet layer, protecting against physical tapping and injection. For native InfiniBand, MACsec is not applicable (IB does not use Ethernet framing); InfiniBand relies on partition-based isolation (P_Key), queue-pair key authentication (Q_Key), and subnet manager access control for security. Neither MACsec nor InfiniBand's native mechanisms provide per-operation cryptographic authentication for one-sided RDMA operations at line rate. ISLE's pragmatic stance: match existing InfiniBand/RoCE security practice (trusted fabric with partition isolation), offer MW-based restriction for multi-tenant or security-sensitive deployments, and upgrade to hardware-enforced per-operation authentication when NIC vendors deliver it (CXL 3.0's integrity model is a likely candidate).

One-sided RDMA authentication: There is a fundamental tension between one-sided RDMA (which requires no remote CPU involvement) and per-operation authentication (which requires CPU processing). The resolution: the trust boundary is the cluster join, authenticated via X25519 key exchange with Ed25519 signatures (Section 47.2.3). Within the cluster, nodes are mutually trusted for RDMA operations. If a node is compromised, its cluster membership is revoked — all QP connections to that node are destroyed and its RDMA mappings are invalidated.

This matches the security model of all existing RDMA deployments: InfiniBand assumes a trusted fabric, and RoCEv2 relies on network isolation (VLANs, VXLANs) for multi-tenant separation.

Default: Trusted mode (pool-wide registration) is the default. This matches existing InfiniBand/RoCE practice — all production RDMA deployments assume a trusted fabric. Set via:

/sys/kernel/isle/cluster/security_mode
    # "trusted" (default) or "secure"

Why trusted mode is the default: The security boundary in an ISLE cluster is the cluster join authentication (Section 47.2.3: X25519 key exchange authenticated with Ed25519 signatures). Once a node is authenticated into the cluster, it is a trusted peer. RDMA pool-wide access does not expand the trust boundary — a compromised node already has full access to cluster resources through authenticated channels.

Datacenter environments provide additional isolation layers: - Physical access control (racks, cages, badge readers) - Network isolation (VLANs, VXLANs, dedicated RDMA fabrics) - Node attestation during cluster join

For these environments, the ~5% overhead of memory windows is unnecessary overhead. Secure mode is appropriate for: - Multi-tenant cloud environments where different customers share RDMA fabric - Environments without physical network isolation - Defense-in-depth deployments where the threat model includes node compromise despite authentication

Switching to secure mode:

echo "secure" > /sys/kernel/isle/cluster/security_mode
# Note: This triggers a brief (~100ms) disruption as QPs are reconfigured

47.4 Distributed IPC

47.4.1 Extending Ring Buffers to RDMA

ISLE's IPC is MPSC ring buffer-based (Section 8, Zero-Copy I/O Path, which includes the MPSC ring buffer protocol), using SQE/CQE structures compatible with io_uring. The same ring buffer protocol works over RDMA.

Local IPC (current design, unchanged):
  ┌──────────┐  shared memory ring  ┌──────────┐
  │ Process A │ ──── SQE/CQE ────► │ Process B │
  └──────────┘   (mapped pages)     └──────────┘
  Same machine, same address space region.
  Zero-copy. Latency: ~200ns.

Distributed IPC (new):
  ┌──────────┐  RDMA ring           ┌──────────┐
  │ Process A │ ──── SQE/CQE ────► │ Process B │
  │ (Node 0)  │   (RDMA write)      │ (Node 1)  │
  └──────────┘                      └──────────┘
  Different machines, RDMA-connected.
  Zero-copy (RDMA, no kernel networking stack). Latency: ~2-3 μs.

The ring buffer protocol (SQE format, CQE format, head/tail pointers,
memory ordering) is identical. Only the transport changes.

47.4.2 Transparent Transport Selection

// isle-core/src/ipc/ring.rs (extend existing)

pub enum RingTransport {
    /// Both endpoints on the same machine.
    /// Ring is in shared memory (mapped into both processes).
    SharedMemory {
        ring_pages: PageRange,
    },

    /// Endpoints on different machines.
    /// Submissions: producer RDMA-writes SQEs into consumer's ring memory.
    /// Completions: consumer RDMA-writes CQEs back.
    /// Doorbell: RDMA send (small control message) notifies consumer.
    Rdma {
        remote_node: NodeId,
        remote_ring_addr: u64,
        remote_ring_rkey: u32,
        local_ring_pages: PageRange,
        doorbell_qp: RdmaQpHandle,
    },

    /// Endpoints connected via CXL fabric (load/store accessible).
    /// Ring is in CXL-attached shared memory. Same as SharedMemory
    /// but pages are on a CXL memory pool node.
    CxlSharedMemory {
        ring_pages: PageRange,
        cxl_node: NumaNodeId,
    },
}

Transport selection is automatic:

Process A calls connect_ipc(target_process_id):
  1. Kernel looks up target process location:
     - Same machine → SharedMemory transport
     - Remote machine with RDMA → Rdma transport
     - CXL-connected pool → CxlSharedMemory transport
  2. Kernel sets up ring buffer with appropriate transport
  3. Process A gets back an IPC handle (opaque)
  4. Process A uses the same SQE submission interface regardless of transport
  5. The SQE/CQE format is identical in all cases

47.4.3 Ring Buffer RDMA Protocol

RDMA Ring Buffer Header Extension

The RDMA transport extends the base DomainRingBuffer header (Section 8a.2) with additional fields for cross-node synchronization. These fields are placed in a separate RdmaRingHeader that precedes the standard header:

/// RDMA-specific ring buffer header extension.
/// Placed immediately before the DomainRingBuffer in RDMA transport mode.
/// Consumer-owned fields are on a separate cache line from producer-owned fields.
#[repr(C, align(64))]
pub struct RdmaRingHeader {
    // === Producer-owned cache line (written by remote producer via RDMA) ===
    /// Sequence number of the last published SQE.
    /// The producer increments this AFTER writing SQE data to the ring.
    /// The doorbell message carries this sequence number.
    /// Memory ordering: producer uses Release, consumer uses Acquire.
    pub producer_seq: AtomicU64,
    /// Reserved for future use (cache line padding).
    _pad_producer: [u8; 56],

    // === Consumer-owned cache line (written locally) ===
    /// Sequence number up to which the consumer has processed.
    /// Used for flow control and to detect dropped entries.
    pub consumer_seq: AtomicU64,
    /// Reserved for future use (cache line padding).
    _pad_consumer: [u8; 56],
}

Protocol: Sequence-Based Doorbell Synchronization

The core challenge is that RDMA Write and RDMA Send operations may arrive at the consumer out of order — the doorbell (RDMA Send) could arrive before the SQE data (RDMA Write) is visible in memory. To solve this, we use a sequence counter that the consumer polls to determine data readiness:

Producer (Node A) submits work:
  1. Write SQE to local staging buffer
  2. RDMA Write: push SQE to consumer's ring memory on Node B
     (one-sided, no CPU involvement on Node B)
  3. RDMA Write: update producer_seq in consumer's RdmaRingHeader.
     The new value is (previous_seq + 1).
     On RC (Reliable Connection) QPs, RDMA Writes are guaranteed to be delivered
     and executed in posting order at the responder. Therefore, the SQE data
     (step 2) is always visible before producer_seq (step 3), without
     requiring any additional fencing.
     Note: IBV_SEND_FENCE is NOT needed here — it only orders operations after
     prior RDMA Reads and Atomics, not after prior Writes.
     (Each producer has a dedicated QP per consumer, so per-QP ordering suffices.)
  4. If consumer was idle: RDMA Send doorbell with payload = new producer_seq value.
     The doorbell serves only as a notification to wake the consumer; the consumer
     does NOT trust the doorbell's sequence value directly. Instead, the consumer
     reads producer_seq from local memory (where it was written by step 3) with
     Acquire ordering to establish the happens-before relationship.

Consumer (Node B) processes work:
  1. Receive doorbell notification (RDMA Send) OR poll timeout
  2. Read producer_seq with Acquire ordering from local RdmaRingHeader
  3. Compare producer_seq to consumer_seq to determine how many entries are ready:
     ready_count = producer_seq - consumer_seq
  4. For each ready entry:
     a. Read SQE from local ring memory (RDMA write from step 2 already placed it there)
     b. Process request
     c. Write CQE to local completion ring
  5. Update consumer_seq with Release ordering after processing
  6. If completions generated: RDMA Write CQEs back to producer's ring on Node A

Latency breakdown:
  RDMA Write (SQE, 64 bytes):  ~1 μs
  RDMA Write (producer_seq):   ~0.5 μs (RC QP guarantees Write-after-Write ordering;
                                no IBV_SEND_FENCE needed — see note above)
  RDMA Send (doorbell):        ~0.5 μs (may arrive before or after step 3;
                                consumer uses seq polling to handle either order)
  Consumer processing:         application-dependent
  RDMA Write (CQE, 32 bytes):  ~1 μs
  Total overhead: ~3 μs round-trip (plus processing time; +0.5 μs vs. non-seq approach
  for the producer_seq write, but eliminates the race condition)

Why sequence-based synchronization is required:

RDMA Send and RDMA Write are different verb types. While RDMA Writes are ordered within a single QP, the relationship between RDMA Send and RDMA Write is not guaranteed by the InfiniBand specification. A naive protocol that sends the doorbell after writing SQE data could observe:

  Timeline (broken protocol):
  Producer: Write SQE --(network)--> Consumer receives SQE in memory
  Producer: Send doorbell --(network)--> Consumer receives doorbell interrupt

  If the doorbell packet arrives first (different routing, less data to transfer),
  the consumer's interrupt handler reads the ring before the SQE RDMA Write has
  arrived — it sees stale or uninitialized data.

The sequence counter solves this by making the data visibility indication itself traveled via RDMA Write (which is ordered with respect to the SQE data writes). The doorbell is merely an optimization to avoid spinning; correctness depends only on the producer_seq field, which the consumer reads with Acquire ordering.

Memory ordering semantics:

Operation	Ordering	Rationale
Producer writes SQE data	Relaxed	No ordering requirement until seq is published
Producer writes producer_seq	Release	Ensures SQE data is visible before seq advances
Consumer reads producer_seq	Acquire	Ensures seq read happens before SQE data read
Consumer writes consumer_seq	Release	Ensures SQE processing completes before ack
Producer reads consumer_seq	Acquire	Ensures ack read happens before reusing slots

On x86-64, Release/Acquire compile to plain MOV instructions (TSO provides the required ordering). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers (STLR/LDAR, fence-qualified atomics, lwsync/isync).

47.4.4 Batching and Coalescing

For high-throughput scenarios (e.g., database replication, event streaming):

/// Batch submission: write multiple SQEs in a single RDMA operation.
/// Amortizes RDMA overhead across N entries.
pub struct RdmaBatchSubmit {
    /// Number of SQEs to submit in this batch.
    pub count: u32,
    /// Maximum time to wait for batch to fill before flushing (μs).
    pub max_coalesce_us: u32,
    /// Minimum batch size before flushing.
    pub min_batch_size: u32,
}

// With batching:
//   Single SQE: ~3 μs overhead per entry
//   Batch of 64 SQEs: ~5 μs total = ~78 ns per entry (38x improvement)

47.5 Distributed Shared Memory

47.5.1 Design Overview

Distributed Shared Memory (DSM) allows processes on different nodes to share a virtual address space. Pages migrate between nodes on demand, using RDMA for transport and page faults for coherence.

Why previous DSM projects failed and why this one can work:

Past Problem	Our Solution
Cache-line coherence (64B) over network	Page-level coherence (4KB minimum)
Bolted onto Linux MM (invasive patches)	Designed into ISLE MM from start
No hardware fault mechanism for remote	RDMA + CPU page fault = standard demand paging
Software TLB shootdown over network	Targeted TLB shootdown via RDMA notification (only invalidating nodes listed in the sharer set)
Single writer protocol (slow)	Multiple-reader / single-writer with RDMA invalidation
No topology awareness	Cluster distance matrix drives all placement decisions
Catastrophic on node failure	Capability-based revocation + replicated directory

47.5.2 Page Ownership Model

Every shared page has exactly one owner node and zero or more reader nodes:

// isle-core/src/dsm/ownership.rs

/// Ownership state of a distributed shared page.
#[repr(u8)]
pub enum DsmPageState {
    /// Page is exclusively owned by this node. No remote copies exist.
    /// This node can read and write freely.
    Exclusive       = 0,

    /// Page is owned by this node, but read-only copies exist on other nodes.
    /// To write: must first invalidate all reader copies.
    SharedOwner     = 1,

    /// This node has a read-only copy. Owner is elsewhere.
    /// To write: must request ownership transfer from current owner.
    SharedReader    = 2,

    /// Page is not present on this node. Owner is elsewhere.
    /// To read or write: fault → request page from owner via RDMA.
    NotPresent      = 3,

    /// Page is being transferred (migration in progress).
    Migrating       = 4,

    /// Ownership transfer is in progress: the directory entry has been updated to
    /// reflect the new owner, but invalidations to current readers have not yet
    /// completed. Nodes that read the directory entry during this window see
    /// `Invalidating` and must spin-wait (via the seqlock protocol) until the
    /// state transitions to `Exclusive`, indicating all invalidations are acked.
    Invalidating    = 5,
}

/// Directory entry for a distributed shared page.
/// Stored on the home node (determined by hash of virtual address).
#[repr(C)]
pub struct DsmDirectoryEntry {
    /// Sequence counter for local CPU consistency on the home node.
    /// DsmDirectoryEntry is larger than 8 bytes, so concurrent reads/writes on the
    /// home node's CPU require a seqlock.
    /// Protocol (home node CPU only — NOT for RDMA):
    ///   Writer (equivalent to Linux `write_seqlock()` / `write_sequnlock()`):
    ///     1. Spin-wait until `sequence` is even (unlocked).
    ///     2. CAS(sequence, even_value, even_value + 1, Acquire) — acquires exclusive
    ///        access AND starts the seqlock write section (makes sequence odd). If
    ///        CAS fails (another writer won the race or a concurrent increment), retry
    ///        from step 1. The CAS provides mutual exclusion: only one writer can
    ///        transition a given even value to odd.
    ///     3. Modify fields.
    ///     4. store(sequence, even_value + 2, Release) — releases exclusive access AND
    ///        completes the seqlock write section (makes sequence even, advanced by 2
    ///        total from the original value). The Release ordering ensures all field
    ///        updates are visible before the sequence becomes even ("consistent").
    ///   The CAS in step 2 provides spinlock semantics (mutual exclusion among writers).
    ///   The even→odd→even transitions provide seqlock semantics for readers. No
    ///   separate spinlock field is needed.
    ///   Reader: read sequence (Acquire) → read fields → re-read sequence (Acquire);
    ///     retry if mismatch/odd.
    ///
    /// **Writer serialization**: The CAS-based seqlock writer protocol (step 2 above)
    /// inherently provides mutual exclusion — only one writer can succeed at the
    /// CAS(even → odd) transition. No separate per-entry spinlock is needed. This is
    /// equivalent to Linux's `write_seqlock()` which combines the spinlock acquisition
    /// and sequence increment into a single atomic operation. The CAS uses Acquire
    /// ordering to prevent subsequent field writes from being reordered before the
    /// sequence becomes odd. The final store uses Release ordering to ensure all field
    /// updates are visible before the sequence becomes even.
    ///
    /// Remote nodes do NOT read directory entries via one-sided RDMA Read — a seqlock
    /// cannot work across separate RDMA operations (no atomicity between reads).
    /// Instead, remote nodes send a two-sided directory lookup request to the home
    /// node (see Section 47.5.4 page fault flow), and the home node's CPU reads the
    /// entry locally (where the seqlock works correctly) and returns it.
    pub sequence: AtomicU64,

    /// Current owner of this page.
    pub owner: NodeId,

    /// Nodes that have read-only copies.
    /// Bitfield: bit N = node N has a copy.
    /// Supports up to MAX_CLUSTER_NODES (64) nodes; a u64 bitmask enables
    /// efficient set operations (add/remove reader in O(1) via bit set/clear).
    pub readers: u64,

    /// Page state on the owner node.
    pub state: DsmPageState,

    /// Version counter (incremented on every ownership transfer).
    /// Used for consistency checks and stale-copy detection.
    pub version: u64,

    /// Membership epoch when this entry's home assignment was last computed.
    /// When cluster membership changes (node join/leave), the modular hash
    /// assignment is recomputed and directory entries are reassigned to new home nodes.
    /// This field records the epoch of the last rebalance so that stale entries
    /// from a previous epoch can be detected and re-homed.
    pub rehash_epoch: u32,

    /// Wait queue for blocking on `Invalidating` → `Exclusive` transitions.
    /// Request handlers that encounter `state == Invalidating` block here instead
    /// of spin-waiting. The wait queue is woken when the invalidation completes.
    /// This field is NOT transmitted over RDMA — it is local to the home node's
    /// kernel memory and does not affect the wire format of directory entries.
    /// Size: ~16-32 bytes depending on architecture (typically a spinlock + list head).
    pub wait_queue: WaitQueueHead,
}

Note: The u64 bitfield limits clusters to MAX_CLUSTER_NODES (64) nodes, matching the design constant defined in Section 47.2.4. This limit is chosen because a u64 bitmask enables O(1) reader set operations (add, remove, membership test via bit manipulation) and fits in a single atomic operation. For larger clusters, extend to [u64; N] or use a compact set representation. The 64-node limit covers the target deployment (rack-scale RDMA clusters). Extension is deferred (see Section 47.14.6).

RDMA pool constraint: DSM pages are allocated from the RDMA-registered memory pool (Section 47.3.3), which defaults to 25% of physical RAM (configurable via /sys/kernel/isle/cluster/rdma_pool_percent, range 1-75%). Only pages within the RDMA pool can be transferred to or fetched from remote nodes. If the pool is exhausted, remote page faults block until pages are freed via LRU-based eviction (the DSM eviction policy reclaims the least-recently-accessed remote-resident pages first). The 25% default is configurable per deployment; workloads with large distributed working sets should increase the pool size accordingly.

47.5.3 Home Node Directory

Each shared page has a home node determined by hashing its virtual address. The home node stores the authoritative directory entry (who owns the page, who has copies). This avoids a centralized directory server.

Directory indexing data structure: The home node stores directory entries in a per-region radix tree indexed by virtual page number within the region. The radix tree uses 9-bit fan-out (512 entries per node, matching page table structure) with RCU-protected reads and per-node spinlocks for modification. For a 1TB region with 4KB pages (268M entries), the radix tree uses approximately 4 levels with ~524K internal nodes (~32MB of metadata). Mutual exclusion for directory entry writes is provided by the CAS-based seqlock writer protocol defined on DsmDirectoryEntry::sequence: the CAS(even → odd) step provides spinlock semantics (only one writer can succeed), and the even→odd→even transitions provide seqlock semantics for readers. This eliminates the need for a separate per-entry spinlock — the sequence field's CAS-based writer protocol inherently serializes writers (see DsmDirectoryEntry writer serialization note in Section 47.5.3).

Page with virtual address VA:
  home_node = hash(dsm_region_id, VA) % cluster_size

The home node might not be the owner or have a copy.
It just maintains the directory entry.

Why hash-based:
  - No single point of failure (every node is home for some pages)
  - O(1) lookup (no traversal)
  - Uniform distribution across nodes (modular hashing)
  - If home node fails: rehash to backup (Section 47.11)

Note: This is modular hashing (hash % cluster_size), NOT consistent hashing.
Modular hashing remaps most entries when cluster_size changes. ISLE targets
fixed-membership clusters where node join/leave is a rare, coordinated event.
When membership changes, directory rehash uses incremental migration:
  1. Quorum leader announces new cluster_size to all nodes.
  2. Each home node computes which of its directory entries map to a
     different home node under the new hash. These entries are marked
     "migrating" but remain readable at the old location.
  3. New entries are placed in the new hash location immediately.
  4. Existing entries are migrated lazily on access: when a node receives
     a directory lookup for an entry it no longer owns (under the new hash),
     it forwards the request to the new home node. If the new home node
     does not yet have the entry, the old home node transfers it on demand.
  5. A background sweep migrates remaining entries at low priority.
  6. The directory maintains both old and new hash functions during the
     migration window. No stop-the-world pause is required. The migration
     duration depends on cluster size and DSM dataset size: for typical
     clusters (8-32 nodes, <100GB DSM), the background sweep completes in
     ~1-5 seconds. For large DSM datasets (e.g., 1TB, ~256M directory entries),
     migration of all entries takes longer — potentially minutes if entries are
     not accessed during the sweep (each migration requires an RDMA round-trip
     of ~3μs). The system remains fully functional during this window via the
     dual-hash lookup (old location serves requests until migration completes).
  7. After all entries are migrated, the old hash function is retired.

In-flight page faults during rehash: A page fault that arrives at a node which
is no longer the home node (under the new hash) is handled by forwarding:
  - The old home node checks if it still has the directory entry locally.
    If yes, it services the request directly (the entry hasn't migrated yet).
  - If the entry has already migrated, the old home node returns a REDIRECT
    response with the new home node ID. The faulting node retries the lookup
    at the new home node. At most one redirect per fault in the common case
    (the new hash is deterministic, so the second lookup always reaches the
    correct node). During active entry migration, the fault handler may spin
    for up to 8 retries before the redirect (see write-fault race below),
    bounding worst-case latency to ~8μs + one redirect.
  - Ownership transfers in progress (Section 47.5.4 steps 3-8) that span the
    rehash boundary complete under the old hash. The entry is migrated to the
    new home node after the transfer completes. This is safe because the
    old home node holds the entry until migration, and the seqlock prevents
    concurrent modification.
  - **Concurrent rehash**: If a second membership change occurs while a rehash is
    in progress, the first rehash is completed before the second begins (the
    **quorum leader** — defined as the lowest-ID node in the current majority
    partition — serializes membership change announcements). During the first
    rehash's 1-5 second migration window, the quorum leader holds a membership-change lock
    that prevents new node join/leave from being processed. New membership events
    are queued and processed sequentially after the current rehash completes. This
    ensures that at most one hash transition is active at any time, preserving the
    "at most one redirect" guarantee.

    **Failure during rehash**: If a node FAILS during a rehash, the failure
    is handled by the existing membership protocol (Section 47.11) — the dead
    node's directory entries are redistributed as part of the rehash, not as a
    separate event. If the **quorum leader** dies while holding the membership-change
    lock, the new quorum leader is deterministically identified as the lowest-ID
    surviving node in the majority partition (Section 47.11.3). The new
    leader inherits the in-progress rehash by querying all surviving nodes for their
    current hash function version (old vs. new). It then either completes the rehash
    (if >50% of entries have migrated) or aborts it (rolling back to the old hash
    function). The membership-change lock is not a distributed lock — it is a logical
    role held by the quorum leader, so quorum-leader reassignment implicitly transfers it.

Write-fault race during directory entry migration: A write fault on a page
whose directory entry is currently being moved to a new home node (step 4
above — lazy on-demand transfer from old to new home node) requires care:
  - Each directory entry carries a per-entry rehash_epoch field (u32) that
    is incremented when the entry is tagged "migrating to new home" (step 2).
  - The fault handler reads the rehash_epoch before and after reading the
    directory entry (as part of the existing seqlock protocol). If the epoch
    has changed, or if the entry's state is "migrating", the fault handler
    treats this as a transient condition and retries the directory lookup after
    a short spin (up to 8 retries with 1 μs backoff, then falls back to the
    REDIRECT path).
  - Alternatively, the home node may hold a per-entry spinlock during the
    entry-transfer step (old home → new home). The fault handler acquiring the
    same lock before reading the entry ensures it either sees the entry fully
    present (pre-transfer) or receives a REDIRECT (post-transfer), with no
    window where the entry is partially visible.
  Both mechanisms are equivalent in correctness; the epoch/retry approach is
  preferred because it avoids blocking the fault path on a lock acquisition.

This is fundamentally different from DLM's consistent hashing (Section 31a.4
in 07-storage.md), which uses a hash ring for minimal redistribution because
lock master reassignment must be fast and non-disruptive.

47.5.4 Page Fault Flow

Process on Node A reads address VA (not present locally):

1. CPU page fault on Node A.
2. ISLE MM handler identifies VA as part of a DSM region.
3. Compute home_node = hash(region, VA) % cluster_size = Node C.
4. RDMA Send to home Node C: "Lookup directory entry for VA."
   (Two-sided — the home node's CPU reads the entry locally using the seqlock
   and returns it. One-sided RDMA Read cannot be used because the DsmDirectoryEntry
   is larger than 8 bytes and a seqlock requires atomic re-reads, which separate
   RDMA Read operations cannot provide. Round-trip: ~4-5 μs.)
   Node C's request handler reads the directory entry under the seqlock. If the
   entry's state is a transient state, the handler resolves it locally before
   responding:
   - **state == Invalidating**: A concurrent write-fault is invalidating readers.
     The handler **blocks on a per-entry wait queue** (see "Invalidating state
     blocking mechanism" below) until the state transitions out of Invalidating
     (typically to Exclusive, once all invalidation acks arrive). The requesting
     node (A) never sees the Invalidating state — the home node resolves it before
     replying. The handler does NOT spin-wait; spinning for up to 1000ms (the
     membership dead timeout) would block the CPU and could cause priority inversion
     or deadlock in the request handler thread pool.
   - **state == Migrating**: The page's home assignment is being transferred to a
     different node due to a membership change. The handler returns EAGAIN. Node A
     retries the directory lookup after a brief backoff (10-50μs), by which time
     the migration has typically completed and the new home node is authoritative.
5. Directory says: owner = Node B, state = Exclusive (or SharedOwner).
6. RDMA Send to Node B: "Request read copy of page at VA."
   (Two-sided, Node B's kernel handles the request.)
7. Node B's handler:
   a. Transitions page state from Exclusive → SharedOwner.
   b. RDMA Write: sends 4KB page to Node A.
   c. RDMA Send: Node B sends directory update request to Node C (add Node A
      to readers). Node C's CPU processes the request locally — updating the
      `readers` bitfield, `state`, and `version` fields atomically using the
      local seqlock. (~4-5 μs round-trip.)
8. Node A receives page:
   a. Installs page in local page table (read-only mapping).
   b. Sets local DsmPageState = SharedReader.
   c. Resumes faulting process.

Total latency: ~10-18 μs (directory lookup ~4-5 μs + ownership request ~3-5 μs
  + page transfer ~3-5 μs + local install ~1 μs)
Compare: NVMe page fault = ~12-15 μs (comparable)

Process on Node A writes to address VA (has read-only copy):

1. CPU page fault (write to read-only page) on Node A.
2. ISLE MM handler identifies DSM page with SharedReader state.
3. Request ownership transfer:
   a. RDMA Send to home Node C: "Request exclusive ownership of VA."
   b. Node C looks up directory: owner = Node B, readers = {A, D}.
      Node C transitions the directory entry to `Invalidating` state (setting
      `owner = Node A` (the requester), `state = DsmPageState::Invalidating`)
      under the seqlock before sending any invalidations. This ensures that any
      concurrent directory lookup during the invalidation window sees the
      `Invalidating` state and waits (see read-fault flow step 4).
   c. Node C sends invalidation to all readers except requester:
      - RDMA Send to Node D: "Invalidate your copy of VA."
      - Node D unmaps page, flushes TLB, sends ack to Node C.
   d. After all sharer acks received:
      **Case 1: home != owner (C != B) — normal forwarding:**
      - Node C sends ownership transfer request to Node B:
        "Transfer ownership of VA to Node A."
      - Node B unmaps page, sends page data directly to Node A via RDMA Write.
      - Node B sends ack to Node C confirming transfer complete.
      - Node B transitions to NotPresent.
      **Case 2: home == owner (C == B) — skip forwarding:**
      - The home node IS the current owner. No forwarding message is needed.
        The home node directly invalidates its own local copy, prepares the
        page data, and sends it to Node A via RDMA Write. This avoids
        self-messaging (which would deadlock or cause infinite forwarding
        if the home node's request handler sends a message to itself).
      - Node C (the home/owner) unmaps its own page, transitions to NotPresent.
      - **Concurrency model (applies to both Case 1 and Case 2)**: The directory
        entry is transitioned to `Invalidating` at step 3b (before any invalidation
        messages are sent), using the CAS-based seqlock writer protocol for mutual
        exclusion. Other request handlers that read the directory entry during the
        invalidation window see the `Invalidating` state and **block on the
        per-entry wait queue** (see "Invalidating state blocking mechanism" below)
        until the state transitions to `Exclusive` (indicating all invalidation
        acks have been received and the transfer is complete). This prevents
        stale-ownership reads: no node can observe the old owner in the directory
        entry while invalidations are in flight. Invalidation requests to readers
        are sent after the seqlock write section completes (sequence returns to
        even), allowing concurrent directory reads to see the `Invalidating` state.
        After all acks arrive, the handler re-acquires the seqlock, transitions
        the state from `Invalidating` to `Exclusive` at step 3f, and **wakes all
        waiters** on the per-entry wait queue. This prevents deadlock because
        other directory operations on the same home node can proceed while waiting
        for invalidation acks (waiters are descheduled, not spinning). Additionally,
        invalidation ack processing on a reader node does NOT require any directory
        operation — it only involves local page table manipulation and an RDMA
        Send reply. Therefore, invalidation acks cannot create circular lock
        dependencies.
   e. Node A confirms receipt of page data by sending ack to Node C.
   f. Node C updates directory: owner = Node A, readers = {}, state = Exclusive.
   g. Node C sends grant to Node A (RDMA Send): "You now own VA exclusively."
4. Node A receives exclusive ownership:
   a. Installs page with read-write mapping.
   b. Resumes faulting process.

Total latency: ~15-25 μs (involves invalidating remote copies)
  Home==owner optimization saves one RDMA round-trip (~3-5 μs) in that case.

Invalidation ack timeout with escalation: Each invalidation request sent by the home node in step 3c carries a 200 μs timeout. The home node does NOT proceed with the ownership transfer until all readers have acknowledged invalidation or been removed from the reader set — doing so would violate coherence (the stale reader could read data that the new exclusive owner has since modified). If a reader does not acknowledge within 200 μs, the home node escalates:

Retry (up to 3 attempts, 200 μs apart): The home node re-sends the invalidation request. A live but slow reader (e.g., handling a long interrupt or scheduling delay) will respond on retry.
Suspect (after 600 μs total): The home node reports the non-responding reader to the membership protocol (Section 47.11) as suspect. If the node is genuinely unreachable, the membership protocol will mark it Suspect after 3 missed heartbeats (300 ms) and Dead after 10 missed heartbeats (1000 ms, per Section 47.11.2), at which point its reader bit is cleared from all directory entries.
Proceed after removal: Once the reader has either acknowledged the invalidation or been marked Dead by the membership protocol (at which point its reader bit is cleared from the directory entry), the ownership transfer proceeds. The write-faulting thread blocks during escalation but is guaranteed forward progress — either the reader responds or the membership protocol eventually marks it Dead and removes it.

This ensures strict coherence: no node can hold a read-only mapping while another node holds exclusive ownership. The worst-case latency for a write fault with an unresponsive reader is ~1000 ms (the membership dead timeout: 10 missed heartbeats at 100 ms intervals, per Section 47.11.2), compared to Linux's 10-30 second fencing delay. In the common case (all readers responsive), the additional cost is zero — the 200 μs timeout never fires.

Maximum Invalidating state duration: A directory entry remains in the Invalidating state from step 3b (when the home node sets it) until step 3f (when all invalidation acks arrive and the state transitions to Exclusive). The maximum duration is bounded by the invalidation ack timeout escalation above: 200 μs initial timeout, up to 3 retries (600 μs), then escalation to the membership protocol which marks an unresponsive node Dead after 1000 ms (10 missed heartbeats at 100 ms intervals, per Section 47.11.2). At that point, the home node removes the unresponsive node from the sharer set and proceeds with the state transition. Therefore, the worst-case Invalidating duration is bounded by the membership dead timeout (~1000 ms). Any concurrent read-fault that arrives at the home node during this window will block on the per-entry wait queue (see read-fault flow step 4) until the state resolves. The waiters are descheduled during this time, not spinning, allowing the CPU to process other requests.

Invalidating state blocking mechanism: Each DsmDirectoryEntry includes a per-entry spinlock (entry_lock: SpinLock) and a kernel wait queue (wait_queue: WaitQueueHead) for blocking on state transitions. The seqlock (sequence field) remains the fast-path mechanism for read-side directory lookups, but the blocking/wakeup path uses a separate spinlock because seqlocks cannot be held across schedule() (the seqlock write side disables preemption internally, and sleeping with preemption disabled is a deadlock).

When a request handler encounters the Invalidating state:

The handler acquires entry_lock (per-entry spinlock).
The handler re-checks the state under the spinlock. If no longer Invalidating, release the spinlock and proceed (avoids unnecessary sleep).
The handler calls prepare_to_wait(&entry.wait_queue, TASK_INTERRUPTIBLE) — this adds the handler to the wait queue while holding the spinlock, preventing the race where a waker calls wake_up_all() between the lock release and the wait queue addition.
The handler releases entry_lock.
The handler calls schedule(), which deschedules the thread. If a concurrent wakeup occurred between steps 3 and 5, schedule() returns immediately without sleeping (the standard Linux wait pattern).
On wakeup, the handler calls finish_wait(&entry.wait_queue) and re-reads the directory entry under the seqlock (optimistic read path) to proceed.

When the ownership transfer completes (step 3f), the invalidating handler: - Acquires the seqlock writer (CAS even→odd, provides writer serialization). - Transitions state from Invalidating to Exclusive. - Releases the seqlock writer (write even). - Acquires entry_lock, calls wake_up_all(&entry.wait_queue), releases entry_lock.

Wait queue safety invariant: The wait queue is protected by entry_lock (a per-entry spinlock), NOT by the seqlock. Waiters add themselves to the queue under entry_lock (step 3), and wakers hold entry_lock while calling wake_up_all(). The seqlock writer and entry_lock are independent — the seqlock serializes directory entry updates (state transitions), while entry_lock serializes wait queue operations. Lock ordering: seqlock writer THEN entry_lock (never reversed).

This blocking mechanism ensures that the home node can process thousands of concurrent DSM requests without CPU-starving due to spin-waits. The wait queue is per-entry (not global), so blocking on one page's invalidation does not affect other pages. The wait queue head is allocated lazily — the DsmDirectoryEntry stores only a pointer (wait_queue: *mut WaitQueueHead, 8 bytes) that is null until the first blocking wait occurs. Most entries never experience contention, so the common case pays only 8 bytes of pointer overhead per entry rather than embedding a full 16-32 byte wait queue head in every entry.

RDMA operation ordering: Multi-step DSM operations (send page data, then update directory) require explicit acknowledgment to guarantee ordering across different QPs. IBV_SEND_FENCE only orders operations within a single QP, but page data goes to the page recipient while the directory update goes to the home node -- these are different QPs on different remote nodes. The correct protocol uses an explicit ack step: 1. Owner sends page data to requester via RDMA Write (on QP to requester). 2. Requester confirms receipt by sending an RDMA Send ack to the home node. 3. Home node updates the directory via local CAS (no cross-QP ordering needed since the home node's CPU processes the ack before updating its own local directory). This adds one extra round-trip (~2 us) compared to a naive fence-based approach, but is required for correctness because RDMA fencing cannot provide cross-QP ordering.

Security requirement — Invalidation ACK authentication: Invalidation ACKs MUST be authenticated to prevent spoofing by malicious nodes. Without authentication, a malicious node could send forged invalidation ACKs to the home node, causing it to prematurely transition the directory entry to Exclusive while a stale reader still holds a mapping — violating cache coherence and potentially leaking data. Authentication requirements:

Each invalidation request from the home node (step 3c) MUST include a cryptographically random 128-bit invalidation_nonce generated fresh for each invalidation batch.
The invalidation ACK from each reader node MUST include:
The same invalidation_nonce (proving the ACK is a response to this specific invalidation request, not a replay)
An HMAC-SHA-256 computed over {nonce || page_va || reader_node_id} using a session key established during cluster join (Section 47.2.3, derived via X25519 Diffie-Hellman and HKDF-SHA256 from the shared secret)
The home node MUST verify the HMAC before accepting the ACK. Invalid or missing HMACs MUST be treated as if the ACK was never received (triggering timeout and escalation to membership protocol).
To limit overhead, the session key is established once during cluster join and rotated every 24 hours. Key rotation uses the authenticated control channel (Section 47.2.3).
ACKs received after the directory state has already transitioned to Exclusive (duplicate ACKs due to network reordering) MUST be silently discarded without error — the nonce lookup will fail since the invalidation batch is complete.

Performance impact: HMAC-SHA-256 verification adds ~500 ns to ACK processing on the home node. This is negligible compared to the ~200 μs timeout window. The nonce prevents replay attacks without requiring per-ACK signatures.

47.5.5 Extending PageLocationTracker

The PageLocation enum (Section 43.1.5) includes RemoteNode, RemoteDevice, and CxlPool variants for distributed memory tracking. The following shows the distributed variants and their semantics:

// Distributed variants of PageLocation (defined canonically in Section 43.1.5)

pub enum PageLocation {
    // ... existing variants (CpuNode, DeviceLocal, Migrating, etc.) ...
    /// Page is in CPU memory on this NUMA node (existing).
    CpuNode(u8),

    /// Page is in accelerator device-local memory (existing).
    DeviceLocal {
        device_id: DeviceNodeId,
        device_addr: u64,
    },

    /// Page is being transferred (migration in progress).
    /// Consistent with DsmPageState::Migrating = 4 (the canonical definition above).
    /// **Canonical definition**: Section 43.1.5 in 11-accelerators.md defines the
    /// Migrating variant with a side-table index (`migration_id: u32`) to keep
    /// `PageLocation` at 24 bytes. The `MigrationRecord` side table (defined in
    /// Section 43.1.5, stored in `PageLocationTracker::active_migrations`) holds
    /// the full source/target details (source_kind, source_node, source_device,
    /// source_addr, target_kind, target_node, target_device, target_addr).
    Migrating {
        migration_id: u32,
    },

    /// Page is not yet allocated (existing).
    NotPresent,

    /// Page is in compressed pool (existing).
    Compressed,

    /// Page is in swap (existing).
    Swapped,

    // === New: distributed memory locations ===

    /// Page is on a remote node's CPU memory, accessible via RDMA.
    RemoteNode {
        node_id: NodeId,
        remote_phys_addr: u64,
        dsm_state: DsmPageState,
    },

    /// Page is on a remote node's accelerator memory (GPUDirect RDMA).
    RemoteDevice {
        node_id: NodeId,
        device_id: DeviceNodeId,
        device_addr: u64,
    },

    /// Page is in CXL-attached memory pool (hardware-coherent).
    CxlPool {
        pool_id: u32,
        pool_offset: u64,
    },
}

Security requirement — CXL pool bounds validation: The kernel MUST validate pool_id and pool_offset before using them to access memory. An out-of-bounds pool_id or pool_offset could allow unauthorized access to memory outside the intended pool, potentially exposing kernel data or allowing privilege escalation. Validation requirements:

pool_id MUST be validated against the global CXL pool registry (cxl_pool_count). Accessing a non-existent pool MUST fail with EINVAL.
pool_offset MUST be validated against the target pool's size (cxl_pools[pool_id].size_bytes). Accesses beyond the pool boundary MUST fail with EFAULT.
For RDMA-initiated CXL pool access, the remote node's capability (Section 47.9) MUST authorize the specific pool_id. A capability granting access to pool 0 MUST NOT be usable to access pool 1.
CXL pool resize is grow-only — pools can be expanded but never shrunk. This eliminates the TOCTOU race where a pool shrinks between the bounds check and the CPU load/store instruction (seqlocks cannot prevent this race because CPU memory accesses are not rollback-capable). The pool's size_bytes is read with Acquire ordering; a concurrent grow only increases the valid range, so a stale (smaller) size produces a conservative bounds check, never an out-of-bounds access. Pool deallocation (destroying a pool entirely) requires quiescing all accessors first via the standard RCU grace period mechanism.

47.5.6 DSM Region Management

Distributed shared memory is opt-in. Processes create DSM regions explicitly:

/// Create a distributed shared memory region.
/// All nodes participating in this region can map it.
pub struct DsmRegionCreate {
    /// Unique region identifier (cluster-wide).
    pub region_id: u64,

    /// Virtual address base for this region (must be page-aligned).
    /// This is the starting virtual address at which the region will be mapped
    /// on all participating nodes. Must not overlap existing mappings.
    pub base_addr: u64,

    /// Size of the shared region (bytes, page-aligned).
    pub size: u64,

    /// Page size for this region.
    pub page_size: DsmPageSize,

    /// Access permissions for this region (read / write / execute bitfield).
    /// DSM_PROT_READ = 0x1, DSM_PROT_WRITE = 0x2, DSM_PROT_EXEC = 0x4.
    /// Applied uniformly to all mappings on all participating nodes.
    pub permissions: u32,

    /// Consistency model.
    pub consistency: DsmConsistency,

    /// Initial placement: which node holds the pages initially.
    pub initial_owner: NodeId,

    /// Home node assignment policy for directory entries.
    /// DSM_HOME_HASH = 0: home_node = hash(region_id, VA) % cluster_size (default).
    /// DSM_HOME_FIXED = 1: all directory entries homed on initial_owner (simple, but
    ///   creates a single hot-spot; use only for small, rarely-accessed regions).
    pub home_policy: u32,

    /// Access control: which nodes can join this region.
    pub allowed_nodes: u64,   // Bitfield, limited to MAX_CLUSTER_NODES (64)

    /// Capability required to map this region.
    pub required_cap: CapHandle,

    /// Behavior flags (bitfield).
    /// DSM_EAGER_WRITEBACK = 0x1: flush dirty pages to owner on lock release
    ///   (Section 47.5.8). Default: lazy writeback on eviction only.
    /// DSM_REPLICATE = 0x2: enable directory replication for fault tolerance
    ///   (Section 47.11.4). Default: single home node per entry.
    pub flags: u32,

    /// Reserved for future extensions; must be zero.
    pub _pad: [u8; 4],
}

#[repr(u32)]
pub enum DsmPageSize {
    /// Standard 4KB pages. Best for random access patterns.
    Page4K      = 0,
    /// 2MB huge pages. Best for sequential / bulk access.
    /// Reduces TLB misses but increases transfer granularity.
    HugePage2M  = 1,
}

#[repr(u32)]
pub enum DsmConsistency {
    /// Release consistency: writes become visible to other nodes
    /// after an explicit release (memory barrier / unlock).
    /// Best performance. Standard for most HPC/ML workloads.
    Release     = 0,

    /// Sequential consistency is NOT supported. Providing a total order over all
    /// memory operations across nodes would require serializing every write through
    /// a single ordering point (or using a distributed total-order broadcast), adding
    /// ~5-15 μs per write operation even on RDMA. This overhead would negate the
    /// performance benefits of DSM for virtually all workloads. Applications requiring
    /// sequential consistency should use explicit distributed locking (DLM, Section
    /// 31a) or message-passing instead of DSM.
    ///
    /// Reserved for potential future use if hardware (e.g., CXL 3.0 with hardware
    /// coherence) provides efficient total ordering.
    SequentialReserved = 1,

    /// Eventual consistency: writes propagate asynchronously.
    /// Lowest overhead. Suitable for read-heavy, stale-tolerant data.
    Eventual    = 2,
}

47.5.6.1 DSM Region Destruction Protocol

Destroying a DSM region requires coordinated cleanup across all participating nodes:

DSM region destruction for region R:

1. Initiator (any node with the region's destroy capability) sends
   REGION_DESTROY(region_id) to all nodes in allowed_nodes bitmask.

2. Each participating node:
   a. Unmaps all local pages belonging to region R from process page tables.
   b. Flushes TLB entries for region R's address range.
   c. For pages where this node is the owner: marks pages as reclaimable
      (returned to the local physical page allocator).
   d. For pages where this node is a reader: discards the local copy
      (no writeback needed — reader copies are clean by the single-writer
      invariant).
   e. Sends REGION_DESTROY_ACK(region_id, node_id) to the initiator.

3. Initiator collects ACKs from all participating nodes.
   Timeout: 5 seconds. Nodes that do not ACK within timeout are assumed
   dead — their pages are abandoned (will be reclaimed when those nodes
   eventually rejoin or are declared dead via heartbeat).

4. After all ACKs received (or timeout):
   a. Initiator sends REGION_DIRECTORY_CLEANUP(region_id) to all nodes
      that serve as home nodes for pages in this region.
   b. Each home node removes all DsmDirectoryEntry records for region R
      from its directory. The backup home node (Section 47.11.3) is also
      notified to remove shadow entries.
   c. The region_id is retired and cannot be reused for 24 hours
      (prevents stale references from delayed messages).

5. Region metadata is removed from /sys/kernel/isle/cluster/dsm/regions.

Error handling:
  - If a process still has region R mapped when destruction is initiated,
    the mapping is force-unmapped and the process receives SIGBUS on
    subsequent access attempts.
  - If the initiator crashes during destruction, any node can resume
    the protocol by re-sending REGION_DESTROY (idempotent — nodes that
    already completed destruction simply re-ACK).

47.5.7 Linux Compatibility Interface

DSM is exposed via standard POSIX shared memory with extensions:

Standard POSIX (works unmodified):
  fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
  ftruncate(fd, size);
  ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  → On a single node, this is standard shared memory. No DSM.

ISLE-specific extension (opt-in):
  fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
  ioctl(fd, ISLE_SHM_MAKE_DISTRIBUTED, &dsm_config);
  ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  → Same mmap'd pointer, but pages can now migrate across nodes.
  → Process on Node B can shm_open("/my_region") and map the same region.

For MPI applications: MPI implementations (OpenMPI, MPICH) can use DSM
regions for intra-communicator shared memory windows, replacing the
current combination of mmap + RDMA + application-level coherence with
kernel-managed coherence.

When multiple nodes write to different offsets within the same page, the page bounces between owners ("false sharing"), degrading performance severely.

Detection: The home node tracks per-page ownership transfer frequency. If a page has more than N ownership transfers per second (default: 100), it is flagged as contended. The contention count is exposed via the isle_tp_stable_dsm_contention tracepoint.

Mitigation 1: Advisory — Log contended pages to a tracepoint, allowing the application to restructure its data layout to avoid cross-node false sharing. This is the lowest-overhead option.

Mitigation 2: Whole-page writeback on release — On a write fault to a contended page under the single-writer model (Section 47.5.2), the owning node sends the entire 4KB page to the home node at the next release point (memory barrier or unlock). This is simpler than TreadMarks-style Twin/Diff because the single-writer invariant guarantees only one node modifies the page at a time — there are no concurrent writers whose changes need merging. The home node distributes the updated page to readers on their next fault. Overhead: one 4KB RDMA Write per release (~1-2 μs). This is acceptable because contention mitigation already implies the page is frequently transferred. Note: Twin/Diff (creating a copy before writing, then diffing to find changed bytes) would only be beneficial under a relaxed multiple-writer consistency model. ISLE's DSM uses single-writer, making whole-page transfer the correct and simpler approach.

Mitigation 3: Sub-page coherence for huge pages — For 2MB huge pages that exhibit contention, the kernel falls back to 4KB sub-page coherence for the contended region. The huge page is logically split into 512 sub-pages, each tracked independently in the DSM directory. The physical huge page is preserved (no TLB cost).

Default: Detection + advisory tracing. Eager writeback on release is opt-in per DSM region via DsmRegionCreate.flags |= DSM_EAGER_WRITEBACK.

47.5.9 Error Handling

Concurrent ownership requests: When two nodes simultaneously request exclusive ownership of the same page, the home node serializes the requests. The first request is processed normally; the second requester is queued and notified when ownership becomes available.

Stale directory: If a node presents an ownership claim with a version counter that doesn't match the directory, the home node detects the stale state and rejects the operation. The requestor retries by re-fetching the directory entry.

Partial transfer: If a page is sent via RDMA Write but the directory update (RDMA Send to home node) fails (e.g., due to a concurrent modification or lost message), the home node detects the version mismatch on the next access and repairs the directory. The transferred page is either adopted (if the transfer was valid) or discarded. Directory updates use two-sided RDMA Send because the DsmDirectoryEntry is 56 bytes (core fields 48 bytes + 8-byte lazy wait queue pointer; too large for 8-byte RDMA Atomic CAS). The home node's CPU processes the update request locally using the seqlock protocol (Section 47.5.4).

Deadlock: If Node A waits for ownership of a page held by Node B, while Node B waits for a page held by Node A, a deadlock occurs. DSM ownership deadlocks are resolved via timeout on ownership requests (default: 10ms). On timeout, the requesting operation is aborted and returns EAGAIN. The application retries with backoff. This is a deadlock recovery mechanism (timeout-based), not true deadlock detection (which requires a wait-for graph). True deadlock detection with a distributed wait-for graph is implemented in the DLM (Section 31a.9 in 07-storage.md), where lock dependencies are tracked explicitly. DSM uses the simpler timeout approach because page ownership requests are transient (microsecond-scale) and building a wait-for graph for every page fault would add unacceptable overhead to the critical path.

47.5.10 Honest Performance Expectations

DSM provides a programming convenience (shared address space), NOT transparent performance parity with local memory. Expectations must be calibrated:

What DSM IS good for:
  - Read-mostly workloads with occasional writes (replicated data).
  - Coarse-grained partitioned workloads where each node mostly touches
    its own partition, with infrequent cross-node access.
  - Replacing explicit message-passing in applications where shared memory
    is a natural fit and page-level granularity matches access patterns.

What DSM is NOT good for:
  - Fine-grained shared data structures (concurrent hash maps, work-stealing
    queues). These cause page-level false sharing and ownership bouncing.
    Use explicit RDMA messaging or partitioned data structures instead.
  - Workloads with random write patterns across the shared space.
    Every cross-node write fault costs ~5-50μs (RDMA round-trip + TLB
    invalidation), vs ~100ns for local memory. This is 50-500x slower.
  - Latency-sensitive paths. DSM page faults are unpredictable.

Performance model (InfiniBand 200Gb/s, ~2μs RTT):
  Local page access (TLB hit):               ~1 ns
  Local page fault (mmap, disk):          ~1-100 μs
  DSM read fault (page not present):      ~10-18 μs (directory lookup ~4-5 μs
                                           + ownership negotiation ~3-5 μs
                                           + page transfer ~3-5 μs
                                           + local install ~1 μs;
                                           see §47.5.4 for detailed flow)
  DSM write fault (exclusive ownership):  ~10-50 μs (invalidate readers + transfer)
  DSM false sharing (bouncing page):       ~100+ μs per iteration (pathological)

The mitigations in §47.5.8 help, but cannot eliminate the fundamental cost of network coherence. Applications must be DSM-aware at the data structure level. The kernel's job is to make the common case fast and the pathological case detectable (tracepoints), not to pretend the network is as fast as local memory.

47.5.11 Interaction with Memory Compression (Section 13)

The DSM protocol and the memory compression tier (Section 13) have a potential conflict: when a page is compressed in the local zpool, should it be advertised as "present" or "not present" to the DSM directory?

Design decision: Compressed pages are treated as locally present in the DSM protocol. The decompression cost (~1-3 microseconds via LZ4) is far lower than a network fetch (~5-50 microseconds over RDMA), so it never makes sense to fetch a remote copy of a page that exists locally in compressed form.

Interaction rules:

Remote node requests a locally compressed page: The owning node decompresses from its zpool, then transfers the uncompressed data via RDMA Write. The zpool entry is freed after transfer.
DSM migrates a page to a remote node: The sending node decompresses first, then sends the uncompressed 4KB page. The receiving node may independently compress it based on local memory pressure.
Compressed page metadata is NOT replicated across nodes. Compression is a node-local optimization, invisible to the DSM directory and coherence protocol.
DSM coherence protocol uses only the {CpuNode, RemoteNode, NotPresent, Migrating} variants of PageLocation (Section 43.1.5) for coherence decisions. A "Compressed" DSM state is NOT added — a compressed page is simply "local on CpuNode(N)" from the DSM protocol's perspective. The full PageLocationTracker tracks all variants (including DeviceLocal, Compressed, Swapped, RemoteDevice, CxlPool), but compression is transparent to the DSM directory.

Edge case — double fault path: If Node B requests a page from Node A, and Node A has that page compressed in its zpool: (1) Node A receives the RDMA request, (2) local MM subsystem decompresses from zpool (~1-2 microseconds), (3) DSM handler completes the RDMA Write with the decompressed data. The requesting node never knows the page was compressed. Total added latency: ~1-2 microseconds.

47.6 Global Memory Pool

47.6.1 Design: Cluster Memory as a Unified Tier Hierarchy

The memory manager already manages a tier hierarchy:

Current (single-node):
  Tier 0: Per-CPU page caches (fastest, smallest)
  Tier 1: Local NUMA node DRAM (fast, local)
  Tier 2: Remote NUMA node DRAM (cross-socket, ~150ns)
  Tier 3: GPU VRAM (Section 43)
  Tier 4: Compressed pool (Section 13)
  Tier 5: Swap (NVMe SSD)

Extended (distributed cluster):
  Tier 0: Per-CPU page caches (unchanged)
  Tier 1: Local NUMA node DRAM (unchanged)
  Tier 2: Remote NUMA node DRAM, same machine (unchanged)
  Tier 3: CXL-attached memory pool (~200-400ns)          ← NEW
  Tier 4: GPU VRAM (unchanged)
  Tier 5: Compressed pool (unchanged)
  Tier 6: Remote node DRAM via RDMA (~3-5 μs)            ← NEW
  Tier 7: Remote node compressed pool via RDMA            ← NEW
  Tier 8: Local NVMe swap (unchanged)
  Tier 9: Remote NVMe via NVMe-oF/RDMA                   ← NEW

Key insight: Tier 6 (remote RDMA, ~3-5 μs) is faster than Tier 8 (local NVMe).
The global memory pool inserts remote memory as a tier BETWEEN
compressed pages and local swap. Compressed pool (Tier 5, ~1-2 μs
decompression) precedes remote RDMA because local decompression is
faster than a network round-trip.

Note: Tier numbers are ordinal positions in the latency-sorted hierarchy, not fixed identifiers. When distributed mode adds new tier sources (CXL-attached memory, remote node DRAM), existing sources shift to higher tier numbers. Code that uses tier numbers MUST NOT hardcode numeric values — instead, use the TierKind enum (defined in Section 12.7) to identify tier types by semantic name (e.g., TierKind::LocalDram, TierKind::GpuVram, TierKind::DsmRemote) and query mem::tier_ordinal(kind) for the current ordinal position of each kind.

47.6.2 Memory Pool Accounting

// isle-core/src/mem/global_pool.rs

/// Global memory pool state (cluster-wide view, maintained per-node).
pub struct GlobalMemoryPool {
    /// Per-node memory availability.
    /// Fixed-capacity array indexed by NodeId; only entries 0..node_count are valid.
    /// Uses MAX_CLUSTER_NODES (Section 47.2.4) to avoid heap allocation.
    nodes: [NodeMemoryState; MAX_CLUSTER_NODES],

    /// Number of active nodes in the cluster.
    node_count: u32,

    /// Total cluster memory (sum of all nodes).
    total_cluster_bytes: u64,

    /// Total available for remote allocation (sum of exported pools).
    total_available_bytes: u64,

    /// Current remote memory usage by local processes.
    local_remote_usage_bytes: AtomicU64,

    /// Current memory exported to remote nodes.
    exported_usage_bytes: AtomicU64,

    /// Policy for remote memory allocation.
    policy: GlobalPoolPolicy,
}

pub struct NodeMemoryState {
    node_id: NodeId,

    /// Total physical memory on this node.
    total_bytes: u64,

    /// Memory available for remote allocation.
    /// Admin-configurable: don't export all memory.
    /// Default: 25% of total (local workloads get priority).
    /// This pool is backed by the RDMA-registered memory region (Section 47.3.3).
    /// Only RDMA-registered pages can be served to remote nodes.
    export_pool_bytes: u64,

    /// Currently allocated to remote nodes.
    export_used_bytes: u64,

    /// Distance to this node (from cluster distance matrix).
    distance_ns: u32,

    /// Bandwidth to this node (bytes/sec).
    bandwidth_bytes_per_sec: u64,
}

pub struct GlobalPoolPolicy {
    /// Maximum percentage of local memory to export for remote use.
    /// Default: 25%. Protects local workloads from starvation.
    pub max_export_percent: u32,

    /// When local memory pressure exceeds this threshold,
    /// start reclaiming exported pages (evicting remote users).
    /// Default: 80% of local memory usage.
    pub reclaim_threshold_percent: u32,

    /// Prefer remote memory over local swap?
    /// Default: true (RDMA is faster than NVMe).
    pub prefer_remote_over_swap: bool,

    /// Maximum remote memory a single process can consume (bytes).
    /// 0 = unlimited (subject to cgroup limits).
    pub per_process_remote_max: u64,
}

Security requirement — Cross-node capability chain validation: Remote memory access via the global memory pool MUST validate capability chains across nodes. A capability issued by Node A that authorizes access to memory on Node A must NOT be usable to access memory on Node B without explicit cross-node authorization. Validation requirements:

Capability scope validation: When Node A's process accesses memory exported by Node B via the global pool, the kernel MUST verify that:
The process holds a valid DistributedCapability (Section 47.9.2) for remote memory access, signed by a trusted issuer
The capability's constraints field explicitly authorizes the target NodeId (or contains NODE_ID_ANY for cluster-wide access)
The capability has not expired, been revoked, or had its generation invalidated (standard distributed capability validation)
The capability's permissions field includes REMOTE_MEMORY_READ and/or REMOTE_MEMORY_WRITE as appropriate for the operation
Delegation chain integrity: If a capability was derived through delegation (e.g., process P1 on Node A delegates to process P2 on Node B), the kernel MUST verify the entire chain:
Each capability in the chain MUST have a valid signature from its issuer
Each delegation MUST have DELEGATE permission in the parent capability
Derived capabilities MUST be strictly less powerful than their parent (no permission amplification)
The delegation depth MUST NOT exceed MAX_CAP_DELEGATION_DEPTH (default: 8) to prevent resource exhaustion
Remote node attestation: Before honoring a cross-node capability, Node B MUST verify that Node A is a current, authenticated cluster member (not revoked, evicted, or marked Dead per Section 47.11). This check is O(1) via the cluster membership bitmap.
Audit logging: Cross-node capability validations that fail MUST be logged to the security audit subsystem (Section 17) with:
Source node ID
Target node ID
Capability object_id and generation
Failure reason (signature invalid, expired, revoked, wrong node, etc.)

47.6.3 The Killer Use Case: AI Model Memory

Cluster: 4 nodes, each 512GB RAM + 4× A100 GPUs (80GB VRAM each)
  Total CPU RAM: 2 TB
  Total GPU VRAM: 1.28 TB
  Total cluster memory: 3.28 TB

Scenario: Run a 405B parameter model (Llama-3.1-405B, ~810GB at FP16)

Without global memory pool (current state of the art):
  - Tensor parallelism across 16 GPUs: each GPU holds 1/16 of model (~50GB)
  - Model fits in GPU VRAM (80GB per GPU)
  - BUT: KV cache for long context (128K tokens) = ~100-200GB additional
  - KV cache spills to CPU RAM via UVM → 10-15 μs per page fault to NVMe
  - Inference latency: dominated by KV cache spills

With global memory pool:
  - GPU VRAM: hot layers (active attention heads, current KV cache entries)
  - Local CPU RAM: warm layers (recent KV cache, inactive attention heads)
  - Remote CPU RAM (RDMA): cold KV cache entries from other nodes
    → 3-5 μs per page fault (faster than local NVMe!)
  - Local NVMe: only for truly cold data (old checkpoints, etc.)

  The kernel manages placement transparently:
  - MigrationPolicy tracks access patterns per page
  - Hot pages migrate toward the accessing GPU/CPU
  - Cold pages migrate to remote nodes with more available memory
  - The ML framework sees a flat address space; kernel handles the rest

Performance impact:
  - KV cache "miss" on remote RDMA: ~5 μs (vs. ~15 μs from NVMe)
  - 3x improvement in tail latency for long-context inference
  - No application code changes required

47.6.4 Cgroup Integration

/sys/fs/cgroup/<group>/memory.remote.max
    # Maximum remote memory this cgroup can consume (bytes)
    # Default: "max" (unlimited, subject to global pool policy)

/sys/fs/cgroup/<group>/memory.remote.current
    # Current remote memory usage (read-only)

/sys/fs/cgroup/<group>/memory.remote.stat
    # Remote memory statistics:
    #   remote_alloc <bytes allocated on remote nodes>
    #   remote_faults <page faults resolved from remote>
    #   remote_migrations_in <pages migrated from remote to local>
    #   remote_migrations_out <pages migrated from local to remote>
    #   rdma_bytes_rx <total RDMA data received>
    #   rdma_bytes_tx <total RDMA data sent>

/sys/fs/cgroup/<group>/memory.tier_preference
    # Override default tier ordering for this cgroup:
    # "local,cxl,remote,compressed,swap" (default)
    # "local,compressed,swap" (disable remote memory for this cgroup)
    # "local,cxl,remote,swap" (skip compression, prefer remote)

47.7 Distributed Page Cache

47.7.1 Problem

When Node A reads a file that Node B recently accessed and has cached, the standard approach (NFS/CIFS) refetches from the storage server over TCP. This ignores that Node B already has the data cached in its page cache and could serve it faster via RDMA.

47.7.2 Design: Cooperative Page Cache

The page cache gains awareness of what other nodes have cached:

Traditional NFS read:
  Node A: read() → VFS → NFS client → TCP → NFS server → disk → TCP → Node A
  Latency: ~200 μs (network + server + disk if not cached on server)

Cooperative page cache (shared filesystem):
  Node A: read() → VFS → page cache miss → WHERE is this page?
    Option 1: Remote page cache (RDMA read from Node B's page cache)
              Latency: ~3-5 μs
    Option 2: Local disk
              Latency: ~10-15 μs (NVMe)
    Option 3: Remote disk (NVMe-oF/RDMA)
              Latency: ~15-25 μs
    Option 4: Traditional NFS/CIFS
              Latency: ~200 μs

    Kernel picks the fastest source automatically.

47.7.3 Page Cache Directory

// isle-core/src/vfs/cooperative_cache.rs

/// Distributed page cache directory.
/// Tracks which nodes have which file pages cached.
/// Uses a Bloom filter per file for space efficiency.
pub struct CooperativeCache {
    /// Per-file cache presence hints.
    /// Key: (filesystem_id, inode_number)
    /// Value: per-node Bloom filter of cached page offsets.
    /// Uses a SlabHashMap (open-addressing hash map backed by the slab allocator)
    /// rather than BTreeMap: O(1) average lookup vs. O(log n) with poor cache
    /// locality. The number of tracked files is unbounded and dynamic, so a
    /// fixed-size array is not applicable; a hash map is the correct structure.
    /// The per-file hint array is fixed at MAX_CLUSTER_NODES (64): one optional
    /// entry per node. This avoids heap allocation per file and provides O(1)
    /// lookup by NodeId (direct indexing). Entries for absent nodes are `None`.
    /// This is not on the page fault hot path — it is consulted on page cache miss,
    /// which already involves I/O — but O(1) lookup avoids unnecessary overhead.
    hints: SlabHashMap<FileId, [Option<NodeCacheHint>; MAX_CLUSTER_NODES]>,
}

pub struct NodeCacheHint {
    node_id: NodeId,
    /// Bloom filter of page offsets this node has cached.
    /// False positives are OK (just a wasted RDMA read, fallback to disk).
    /// False negatives are not OK (would miss a cache hit).
    /// 4KB Bloom filter per file per node → ~0.1% FP rate for 10K pages.
    bloom: BloomFilter,
    /// Last update timestamp (hints expire after 30 seconds).
    updated_ns: u64,
}

impl CooperativeCache {
    /// Find the best source for a cache page.
    pub fn find_page(
        &self,
        file: FileId,
        page_offset: u64,
        local_node: NodeId,
        distances: &ClusterDistanceMatrix,
    ) -> PageSource {
        // 1. Check local page cache first (always).
        // 2. Check Bloom filters: which remote nodes might have this page?
        // 3. Of those, pick the closest (lowest distance in matrix).
        // 4. If no remote cache hit: fall back to storage.
    }
}

pub enum PageSource {
    /// Page is in local page cache. No I/O needed.
    LocalCache,
    /// Page is likely cached on a remote node. Try RDMA read.
    RemoteCache { node_id: NodeId, expected_latency_ns: u32 },
    /// Page is not cached anywhere. Read from storage.
    Storage { device: DeviceNodeId },
    /// Page is on remote storage (NVMe-oF/RDMA).
    RemoteStorage { node_id: NodeId, device: DeviceNodeId },
}

47.7.4 Cache Coherence for Shared Files

When multiple nodes cache the same file, writes must be coordinated:

Write strategy (per-file, configurable):

1. Write-invalidate (default for shared mutable files):
   - Writer acquires exclusive ownership (like DSM write fault)
   - All reader copies are invalidated via RDMA
   - Writer modifies page, becomes sole cached copy
   - Other nodes re-fault on next access (get updated page)

2. Write-through (for append-only logs, databases):
   - Writer writes to local page cache AND pushes update to owner
   - Owner propagates to all readers via RDMA write
   - Higher bandwidth cost, but readers see updates faster

3. No coherence (for read-only data, e.g., shared model weights):
   - File is marked immutable (or read-only mounted)
   - All nodes cache freely, no invalidation needed
   - Best case for ML inference: model weights cached everywhere

Integration with the Distributed Lock Manager (Section 31a):

For clustered filesystems (Section 31), page cache coherence is coordinated through DLM lock operations rather than the DSM coherence protocol. The DLM provides filesystem-aware semantics that the generic DSM protocol cannot:

Lock downgrade triggers targeted writeback: When a DLM lock is downgraded from EX to PR (Section 31a.8), only dirty pages tracked by the lock's LockDirtyTracker are flushed — not the entire inode's page cache. This eliminates the Linux problem where dropping a lock on a large file requires flushing all dirty pages regardless of how many were actually modified.
MOESI-like page states: Each cached page on a clustered filesystem carries a coherence state relative to the DLM lock protecting it:
Modified: Page dirty under EX lock. Sole copy in the cluster.
Owned: Page was modified, then lock downgraded to PR. This node is responsible for writing back on eviction. Other nodes may have Shared copies.
Exclusive: Page clean, held under EX lock. Can transition to Modified without network traffic.
Shared: Page clean, held under PR/CR lock. Read-only.
Invalid: Lock released or revoked. Page must be re-fetched on next access.
Per-lock-range dirty tracking: The cooperative page cache directory (Section 47.7.3) integrates with Section 31a.8's LockDirtyTracker to record which pages were dirtied under which lock range. On lock downgrade, the writeback is scoped to the lock's byte range — concurrent holders of non-overlapping ranges on the same file are not affected.

47.7.5 AI Training Data Pipeline

Training data scenario:
  - 100TB dataset on shared NVMe storage
  - 8 training nodes, each with 4 GPUs
  - Each node reads different shards, but shards overlap (data augmentation)

Without cooperative cache:
  Each node reads its shard from storage independently.
  If Node A and Node B need the same page: two storage reads.

With cooperative cache:
  Node A reads page from storage → cached in Node A's page cache.
  Node B needs same page → Bloom filter shows Node A has it.
  Node B fetches via RDMA Read from Node A: ~3 μs (vs ~15 μs from NVMe).
  Storage bandwidth saved: proportional to shard overlap.

  For a typical ImageNet-style dataset with 30% shard overlap:
  ~30% reduction in storage I/O, ~30% more effective storage bandwidth.

47.8 Cluster-Aware Scheduler

47.8.1 Problem

The scheduler (Section 14) currently optimizes process placement within a single machine: NUMA-aware load balancing, work stealing between CPUs, migration cost modeling. For a distributed kernel, the scheduler should consider the entire cluster.

47.8.2 Design: Two-Level Scheduler

Level 1: Global Cluster Scheduler (runs every ~10s, lightweight)
  - Monitors per-node load (CPU, memory, accelerator utilization)
  - Decides process-to-node placement
  - Triggers process migration when data locality warrants it
  - Respects node affinity, cgroup constraints, capability requirements

Level 2: Per-Node Scheduler (existing, runs every ~4ms)
  - CFS/EEVDF + RT + DL queues (unchanged)
  - NUMA-aware CPU placement (unchanged)
  - Accelerator scheduling (Section 42.2.4, unchanged)

47.8.3 Global Scheduler State

// isle-core/src/sched/cluster.rs

pub struct ClusterScheduler {
    /// Per-node load summary (updated via periodic RDMA exchange).
    /// Fixed-capacity array indexed by NodeId; only entries 0..node_count are valid.
    /// Uses MAX_CLUSTER_NODES (Section 47.2.4) to avoid heap allocation.
    node_loads: [NodeLoad; MAX_CLUSTER_NODES],

    /// Process-to-data affinity map.
    /// Tracks which node holds most of a process's working set.
    /// Uses a slab allocator keyed by ProcessId for O(1) average lookup;
    /// BTreeMap's O(log n) tree traversal with poor cache locality is
    /// unsuitable for per-scheduling-tick access under load.
    data_affinity: SlabMap<ProcessId, DataAffinity>,

    /// Cluster distance matrix (Section 47.2.4).
    distances: ClusterDistanceMatrix,

    /// Global load balance interval.
    balance_interval_ms: u32,   // Default: 10000ms (10 seconds)

    /// Migration threshold: only migrate if improvement exceeds this.
    migration_threshold_ppt: u32,   // Default: 300 (300/1000 = 30% locality improvement)
}

pub struct NodeLoad {
    node_id: NodeId,

    /// CPU utilization 0-100% (average across all CPUs).
    cpu_percent: u32,

    /// Memory pressure 0-100% (0 = plenty free, 100 = thrashing).
    memory_pressure: u32,

    /// Accelerator utilization 0-100% (average across all accelerators).
    accel_percent: u32,

    /// Number of runnable processes.
    runnable_count: u32,

    /// Remote memory faults per second (high = poor locality).
    remote_fault_rate: u32,
}

pub struct DataAffinity {
    /// How many pages of this process's working set are on each node.
    /// Used to decide where a process should run.
    // Fixed array indexed by NodeId; O(1) access, no allocation. MAX_CLUSTER_NODES=64.
    pages_per_node: [u64; MAX_CLUSTER_NODES],

    /// Total working set size (pages).
    total_working_set: u64,

    /// Node with most pages (preferred placement).
    preferred_node: NodeId,

    /// Working set locality on preferred node (0-1000, parts per thousand).
    locality_score_ppt: u32,
}

47.8.4 Process Migration

When the cluster scheduler decides a process should move to another node:

Process migration from Node A to Node B:

1. Cluster scheduler on Node A decides: process P should run on Node B.
   Reason: 70% of P's working set is on Node B (remote page faults dominate).

2. Pre-migration:
   a. Send process metadata to Node B: PID, capabilities, cgroup, open files,
      signal handlers, register state.
   b. Node B allocates process slot, creates local task struct.

3. Freeze and transfer:
   a. Freeze process P on Node A (stop scheduling, save register state).
   b. Transfer register state to Node B via RDMA (~64 bytes, ~1 μs).
   c. Transfer kernel stack to Node B via RDMA (~16KB, ~2 μs).
   d. Mark P's page table on Node A as "migrated to Node B."
      Pages are NOT bulk-transferred — they fault in on demand.

4. Resume on Node B:
   a. Node B installs process in local scheduler.
   b. Process resumes execution on Node B.
   c. First memory access → page fault → fetch from Node A via RDMA (~3-5 μs).
   d. Subsequent accesses: pages migrate on demand.
   e. Working set migrates over ~100ms as pages are faulted in.

Total migration downtime: ~10-50 μs (metadata transfer + freeze/thaw)
Working set follows lazily: pages migrate over seconds as accessed.

This is the same strategy as live VM migration (pre-copy / post-copy),
but at process granularity. Much lighter weight than VM migration.

Full process migration requires transferring the following state:

Register state: CPU registers, FPU/SIMD state (saved during freeze).
Kernel stack: The process's kernel-mode stack (~16KB).
Page table metadata: Transferred lazily — pages fault in on demand from the source node via RDMA.
Open file descriptors: For local files on the source node, a proxy is created that forwards read/write/ioctl over RDMA IPC to the source. For shared-filesystem files (NFS, CIFS), the descriptor is re-opened locally on the destination.
IPC handles: SharedMemory transport handles are converted to Rdma transport handles (Section 47.4.2). The ring buffer contents are preserved.
Cgroup membership: Recreated on the destination node. The destination cgroup must have sufficient quota.
Signal state: Pending signals, signal mask, and signal handlers are transferred.
Timer state: Active timers (POSIX, itimer) are recreated on the destination with remaining duration adjusted for clock synchronization (Section 47.11.5).
Device handles: Local accelerator contexts are converted to RemoteDeviceProxy handles (Section 47.10.2). The process continues to use the original device via RDMA-proxied commands.

Not all state is migratable in v1. See Open Questions (Section 47.14.6) for deferred items including io_uring migration, TCP connection migration, namespace handling, and ptrace state.

Process migration scope (v1): - In scope: CPU register state, page tables (lazy), local file proxying, IPC handle conversion, cgroup membership, signal state, timer state, accelerator handle proxying. - Out of scope for v1: Active TCP/UDP connections (must be re-established), active RDMA queue pairs (application must handle), hardware-specific state (GPU VRAM contents, NIC offload state), processes with CLONE_VM threads (thread group migration deferred). - Limitation: A process holding hardware resources that cannot be proxied (e.g., direct GPU rendering context with bound VRAM) will fail migration with -EOPNOTSUPP. Such processes should set CLUSTER_PIN_NODE to prevent migration attempts.

47.8.5 Capability-Gated Migration

Process migration requires capabilities:

pub const CLUSTER_MIGRATE: u32  = 0x0200;  // Allow process migration to remote nodes
pub const CLUSTER_PIN_NODE: u32 = 0x0201;  // Pin process to specific node (prevent migration)
pub const CLUSTER_ADMIN: u32    = 0x0202;  // Cluster-wide scheduler administration

Processes without CLUSTER_MIGRATE are never migrated. Processes with CLUSTER_PIN_NODE can pin themselves to their current node. Containers/cgroups can restrict which nodes their processes can run on:

/sys/fs/cgroup/<group>/cluster.nodes
    # Allowed nodes for this cgroup: "0 1 2" or "all"
    # Default: current node only (no migration)

/sys/fs/cgroup/<group>/cluster.migrate
    # "auto" (kernel decides), "never" (pinned), "prefer" (hint)
    # Default: "never" (existing Linux behavior)

47.8.6 Reconciliation: Local vs Distributed Scheduling

The single-node scheduler (Section 14) optimizes for cache locality — keeping tasks on the same CPU core. The distributed scheduler may migrate tasks across nodes, destroying all cache state.

Design principle: Cross-node migration is a last resort, not a default action.

Two-level hierarchy with strict separation:

Intra-node (Section 14): CFS/EEVDF handles all CPU-local decisions (~4ms tick). Entirely unaware of the cluster.
Inter-node: ClusterScheduler runs every 10 seconds — deliberately slow because cross-node migration is 1000x more expensive than cross-CPU migration.

Migration threshold: A task migrates cross-node only when ALL of these hold:

Source node CPU utilization exceeds 120% of cluster average AND target is below 80% (sustained for 2+ rebalance intervals), OR
The task's working set is predominantly on the target node (>70% of pages, per DataAffinity), OR
The task's affinity mask explicitly requests a different node.

Migration cost model:

migration_benefit = (source_load - target_load) * task_weight
migration_cost = cache_refill_time + network_transfer_time + tlb_flush_time

Migration proceeds only when migration_benefit > migration_cost * 1.5 (50% hysteresis to prevent oscillation).

Warm-up penalty: After cross-node migration, the task's effective load weight is inflated 2x for 20 seconds, preventing immediate re-migration before cache state builds.

47.9 Network-Portable Capabilities

47.9.1 Problem

ISLE's capability system (Section 11) uses opaque kernel-memory tokens validated locally. For distributed operation, capabilities must work across nodes: a process migrated from Node A to Node B should retain its capabilities. A remote RDMA operation should be authorized by a capability that the remote node can verify.

47.9.2 Design: Cryptographically-Signed Capabilities

// isle-core/src/cap/distributed.rs

/// Network-portable capability — split into a compact header (fits on the
/// kernel stack, ~64 bytes) and a separately-allocated signature payload.
///
/// Rationale: PQC signatures (ML-DSA-65: 3,309 bytes, hybrid: 3,373 bytes)
/// make the full capability ~3.6 KB, which must NOT be placed on the kernel
/// stack (kernel stacks are 8-16 KB). The split design keeps the hot path
/// (permission checks, expiry checks) on the stack via CapabilityHeader,
/// while the signature data lives in a slab-allocated CapabilitySignature.
#[repr(C)]
pub struct CapabilityHeader {
    /// The local capability this was derived from.
    pub object_id: ObjectId,
    pub permissions: PermissionBits,
    pub generation: u64,
    pub constraints: CapConstraints,

    // === Network portability extensions ===

    /// Node that issued this capability.
    pub issuer_node: NodeId,

    /// Timestamp of issuance (cluster-relative wall clock,
    /// synchronized via PTP/NTP, see Section 47.11.5).
    pub issued_at_ns: u64,

    /// Expiry timestamp (cluster-relative wall clock, see Section 47.11.5).
    /// Capabilities MUST have bounded lifetime
    /// (prevents stale capabilities after node failure).
    /// Default: 5 minutes. Renewable while issuer is alive.
    /// Expiry checking includes a 1ms grace period for clock skew.
    pub expires_at_ns: u64,

    /// Signature algorithm identifier.
    /// Uses SignatureAlgorithm encoding (Section 25.2). u16 accommodates
    /// hybrid algorithm IDs (0x0200+).
    pub sig_algorithm: u16,

    /// Pointer to the slab-allocated signature data.
    /// The signature is allocated from a dedicated `cap_sig_slab` pool
    /// (fixed-size 3,588-byte slots) to avoid general heap allocation
    /// on the capability verification path.
    pub signature: *const CapabilitySignature,
}

/// Signature payload for a distributed capability.
/// Allocated from a dedicated slab allocator (`cap_sig_slab`), NOT from
/// the general heap, to ensure bounded allocation latency on the
/// capability verification path.
///
/// Signature formats supported for distributed capabilities:
///   - Ed25519: 64 bytes (current default)
///   - ML-DSA-65: 3,309 bytes (PQC migration target, Section 25)
///   - Hybrid Ed25519 + ML-DSA-65: 3,373 bytes (transition mode)
///
/// SLH-DSA-128f is deliberately EXCLUDED from distributed capabilities.
/// Its 17,088-byte signatures would add ~17 KB to every cross-node
/// capability transfer, making it impractical for runtime operations
/// that occur at lock-acquisition frequency. SLH-DSA-128f is supported
/// for boot signatures (Section 22, KernelSignature/DriverSignature
/// structs with 17,408-byte buffers) where it is verified once at load
/// time. For distributed capabilities, ML-DSA-65 provides NIST PQC
/// security at 1/5th the signature size.
///
/// MAX_DISTRIBUTED_SIG_BYTES = 3,584 (3.5 KiB, accommodates hybrid
/// signatures with alignment headroom).
#[repr(C)]
pub struct CapabilitySignature {
    /// Actual signature length in bytes.
    pub sig_len: u16,
    pub _pad: [u8; 2],
    /// Signature data. Only sig_len bytes are meaningful.
    pub data: [u8; 3584],
}

/// Convenience type combining header + signature for full capability operations.
/// Never placed on the stack as a whole — the header is on the stack and
/// the signature is accessed via pointer.
pub type DistributedCapability = (CapabilityHeader, *const CapabilitySignature);

Memory layout note: The CapabilityHeader is ~64 bytes and safe to place on the kernel stack. The CapabilitySignature is 3,588 bytes and allocated from a dedicated slab pool (cap_sig_slab, 3,588-byte slots) — never on the stack. For RDMA transmission, the capability is serialized using a compact wire format that includes only sig_len bytes of signature data, reducing typical message size to ~200-400 bytes (Ed25519: 64-byte signature + ~100 bytes of header fields). The sig_len field determines how many bytes of the signature are included in the wire format.

Slab Allocator and DoS Mitigation: The cap_sig_slab pool uses per-process quotas to prevent a malicious process from exhausting kernel memory by allocating many signatures:

/// Per-process quota for capability signature allocations.
/// Tracked in the process's capability space (Section 11).
pub struct CapSignatureQuota {
    /// Maximum signature slots this process may hold simultaneously.
    /// Default: 1024 (approximately 3.6 MB worst case).
    /// Configurable via prctl(PR_SET_CAP_QUOTA).
    pub max_slots: u32,

    /// Currently allocated slots for this process.
    /// Incremented on successful slab allocation, decremented on free.
    pub used_slots: AtomicU32,
}

/// Default quota: 1024 signatures (~3.6 MB per process).
pub const DEFAULT_CAP_SIGNATURE_QUOTA: u32 = 1024;

/// System-wide limit on total signature slab memory.
/// Default: 1 GB (approximately 290,000 signatures).
pub const CAP_SIG_SLAB_TOTAL_LIMIT: usize = 1024 * 1024 * 1024;

Allocation protocol:

When a process derives a DistributedCapability, the kernel attempts to allocate a signature slot from cap_sig_slab.
Before allocation, check used_slots.load(Acquire) < max_slots. If exceeded, fail with EAGAIN (transient) — the process must free existing capabilities first.
After successful allocation, increment used_slots with Release ordering.
On capability drop (explicit revoke, expiry, or process exit), decrement used_slots and return the slot to the slab.

Eager reclamation: When a process terminates (normal exit or killed), all its CapabilitySignature slots are freed immediately. The capability space tracks all signatures via an intrusive linked list per process — O(1) enumeration for cleanup.

Memory pressure handling: If the system-wide cap_sig_slab exceeds 80% of CAP_SIG_SLAB_TOTAL_LIMIT, the kernel triggers a reclamation pass: - Scan all processes for expired capabilities (where expires_at_ns < now). - Free signatures for expired capabilities regardless of process quota. - This ensures that expired capabilities don't accumulate under memory pressure.

Rationale for 1024-slot default: A typical process holds <100 distributed capabilities (file handles, memory regions, accelerator contexts). 1024 provides 10x headroom while limiting worst-case memory consumption to ~3.6 MB per process. For specialized workloads (distributed databases, HPC orchestrators), the admin can raise the limit via prctl.

47.9.3 Verification

Remote capability verification (any node):

1. Process on Node B presents a CapabilityHeader + CapabilitySignature to access
   a resource. The header is on the stack; the signature is dereferenced from the slab.
2. Node B checks:
   a. Signature valid? (verify with issuer_node's public key using the
      algorithm specified in sig_algorithm — Ed25519 ~25 μs typical, ML-DSA-65 ~110 μs)
      → done once, then cached for lifetime of capability (keyed by object_id + generation)
   b. Not expired? (compare expires_at_ns with current time)
      → ~10 ns (clock comparison)
   c. Generation still valid? (check local revocation list)
      → ~100 ns (hash table lookup)
   d. Permissions sufficient for requested operation?
      → ~10 ns (bitfield comparison)
3. If all checks pass: operation is authorized.
   Total first-time verification: ~25 μs (Ed25519) or ~110 μs (ML-DSA-65).
   Subsequent verifications (signature cached): ~200 ns.

47.9.4 Revocation

Capability revocation in a distributed system is harder than on a single node because capabilities may be cached on remote nodes that are temporarily unreachable.

Single-node revocation (Section 11.1): Generation-based. O(1), no table scanning.

Distributed revocation protocol:

Initiation: Home node increments the object's generation in the home directory.
Broadcast: Home node sends CapRevoke { object_id, new_generation } to all nodes holding capabilities for this object. The grant log tracks which nodes are affected — only those nodes receive the broadcast.
Acknowledgment: Each remote node marks matching capabilities invalid, responds with CapRevokeAck.
Stale rejection: If a remote node presents an old-generation capability to the home node before receiving the broadcast, the home node rejects it immediately.
Partition handling: If a node is unreachable, revocation messages are queued in a durable per-node outbox. When the partition heals, queued revocations replay in order. During the partition, stale capabilities work for local cached reads but fail for any operation requiring home-node validation.

Consistency guarantee: Revocation is eventually consistent. After broadcast completes and all nodes acknowledge, the capability is universally invalid. During the broadcast window (typically <1ms on RDMA), stale capabilities may succeed for local cached reads but fail for home-node operations.

RDMA optimization: Broadcasts use RDMA SEND over reliable connected (RC) queue pairs for guaranteed delivery and ordering.

Interaction with expiry: Distributed capabilities have bounded lifetimes (default: 5 minutes). Revocation is the fast path; expiry is the safety net — even if revocation messages are permanently lost, no stale capability survives beyond its expiry window.

Scalability: The broadcast is O(N) where N = nodes holding capabilities for the specific object (typically 2-5), not the cluster size.

47.9.5 Use Case: Remote GPU Access

Process on Node A wants to submit work to GPU on Node B:

1. Process has local capability: ACCEL_COMPUTE for GPU on Node B.
2. Kernel derives DistributedCapability, signs with Node A's key.
3. Kernel sends command submission + capability to Node B via RDMA.
4. Node B's kernel verifies capability:
   - Signature valid (Node A's key)
   - Not expired
   - Not revoked
   - Has ACCEL_COMPUTE permission for this GPU
5. Node B's AccelScheduler accepts the submission.
6. Completion notification sent back to Node A via RDMA.

The process on Node A uses the same AccelContext API
as for a local GPU. The kernel handles the distribution.

47.10 Distributed Device Fabric

47.10.1 Remote Device Access

The KABI vtable model (Section 7) naturally extends to remote devices. A device on Node B can be accessed from Node A through a proxy driver:

Local device access (existing):
  Process → syscall → ISLE Core → KABI vtable call → Driver → Device

Remote device access (new):
  Process → syscall → ISLE Core → KABI vtable call → RDMA Proxy Driver
    → RDMA transport → Node B ISLE Core → KABI vtable call → Driver → Device

The proxy driver implements the same KABI vtable as the real driver.
It translates vtable calls into RDMA messages. ISLE Core cannot tell
the difference between a local driver and a proxy driver.

47.10.2 Proxy Driver

// isle-core/src/distributed/proxy.rs

/// A proxy driver that forwards KABI vtable calls to a remote node.
/// Appears as a normal driver to the local kernel.
pub struct RemoteDeviceProxy {
    /// Remote node hosting the actual device.
    remote_node: NodeId,

    /// Remote device ID (on the remote node's device registry).
    remote_device_id: DeviceNodeId,

    /// RDMA connection to remote node.
    transport: KernelTransport,

    /// Cached device info (refreshed periodically).
    cached_info: AccelDeviceInfo,

    /// Capability header for accessing the remote device.
    /// The associated CapabilitySignature is slab-allocated and referenced
    /// via the header's signature pointer.
    remote_cap: CapabilityHeader,

    /// Outstanding remote calls (for timeout/cancellation).
    /// Uses a slab allocator (not a BTreeMap) for O(1) insertion and lookup by
    /// call ID: the call ID encodes the slab slot index directly, so no tree
    /// traversal is needed. The slab is bounded by the per-proxy in-flight limit
    /// (max_inflight from NodeConnection), which is fixed at connection setup time.
    pending_calls: SlabMap<PendingRemoteCall>,
}

This enables:

Scenario	How It Works
Node A uses GPU on Node B	AccelBase proxy over RDMA. Command buffers sent via RDMA Write.
Node A reads NVMe on Node B	Block I/O proxy over RDMA (equivalent to NVMe-oF, but kernel-native).
Node A uses FPGA on Node B	AccelBase proxy. Same vtable, different device class.
Kubernetes GPU sharing	Kubelet on Node A schedules pod needing GPU. GPU is on Node B. Kernel-transparent.

47.10.3 GPUDirect RDMA Across Nodes

For GPU-to-GPU communication across nodes (essential for distributed training):

Current state (NCCL on Linux):
  GPU 0 (Node A) → PCIe → CPU RAM (Node A) → RDMA NIC → Network →
  → RDMA NIC → CPU RAM (Node B) → PCIe → GPU 0 (Node B)
  Copies: 2 (GPU→CPU, CPU→GPU). CPU involvement: yes.

With GPUDirect RDMA (supported by Mellanox NICs + NVIDIA GPUs):
  GPU 0 (Node A) → PCIe → RDMA NIC → Network →
  → RDMA NIC → PCIe → GPU 0 (Node B)
  Copies: 0. CPU involvement: none.

ISLE integration:
  - P2P DMA (Section 43) handles local GPU↔NIC path
  - KernelTransport handles the RDMA portion
  - RdmaDeviceVTable.register_device_mr() (Section 46.1.3) registers GPU VRAM for RDMA
  - Combined path: GPU→NIC→Network→NIC→GPU, zero CPU copies

The kernel manages the IOMMU mappings on both ends, ensuring that:
  - GPU VRAM is registered as an RDMA memory region
  - RDMA NIC has IOMMU permission to DMA to/from GPU BAR
  - Remote node's RDMA NIC has permission via remote rkey
  - Capability system authorizes the cross-node GPU-to-GPU transfer

47.11 Failure Handling and Split-Brain

47.11.1 Failure Model

Distributed systems have failure modes that single-machine kernels don't:

Failure	Detection	Recovery
Node crash (power loss)	Heartbeat timeout (Suspect at 300ms, Dead at 1000ms per Section 47.11.2)	Reclaim resources, invalidate capabilities
Network partition	Heartbeat timeout + asymmetric	Split-brain protocol (Section 47.11.3)
RDMA NIC failure	Link-down event + failed RDMA ops	Fallback to TCP or isolate node
Slow node (Byzantine)	Heartbeat latency spike	Mark suspect, reduce trust
Storage failure	I/O error from block driver	FMA-managed (Section 39)

47.11.2 Heartbeat Protocol

// isle-core/src/distributed/heartbeat.rs

pub struct HeartbeatConfig {
    /// Heartbeat interval.
    /// Default: 100ms (10 heartbeats/sec).
    pub interval_ms: u32,

    /// Miss count before marking node Suspect.
    /// Default: 3 (300ms of silence → Suspect).
    pub suspect_threshold: u32,

    /// Miss count before marking node Dead.
    /// Default: 10 (1000ms of silence → Dead).
    pub dead_threshold: u32,

    /// Heartbeat uses RDMA Send (two-sided) not RDMA Write (one-sided)
    /// because we need the remote CPU to respond (proof of liveness).
    pub transport: HeartbeatTransport,
}

// The heartbeat sender and receiver threads run at SCHED_FIFO priority
// (configurable, default priority 50) to avoid false suspect transitions
// caused by CPU saturation delaying heartbeat processing. For non-RDMA
// (TCP) clusters where network latency is higher and more variable, the
// recommended defaults are: interval_ms=500, suspect_threshold=3 (1500ms),
// dead_threshold=10 (5000ms).

/// Heartbeat message (sent via RDMA Send, 64 bytes).
#[repr(C)]
pub struct HeartbeatMessage {
    /// Sender node ID.
    pub node_id: NodeId,       // 4 bytes
    pub _pad0: u32,            // 4 bytes — explicit alignment padding for generation
    /// Monotonic generation (incremented on node restart).
    /// If generation changes, it means the node rebooted.
    pub generation: u64,
    /// Sender's current timestamp (for clock skew estimation).
    pub timestamp_ns: u64,
    /// Sender's load summary (for cluster scheduler).
    pub cpu_percent: u32,
    pub memory_pressure: u32,
    pub accel_percent: u32,
    pub _pad1: u32,            // 4 bytes — explicit alignment padding for membership_view
    /// Sender's view of cluster membership (u64 bitfield, MAX_CLUSTER_NODES = 64).
    pub membership_view: u64,
    /// Explicit padding to bring total struct size to 64 bytes (one cache line).
    /// Layout: node_id(4) + _pad0(4) + generation(8) + timestamp_ns(8) +
    /// cpu_percent(4) + memory_pressure(4) + accel_percent(4) + _pad1(4) +
    /// membership_view(8) + _pad(16) = 64.
    pub _pad: [u8; 16],
}

47.11.3 Split-Brain Resolution

Network partition can cause split-brain: two groups of nodes each believe the other group has failed.

Strategy: Majority quorum + lease-based fencing tokens.

Cluster: Nodes {A, B, C, D, E} (5 nodes)
Partition: {A, B, C} can talk to each other. {D, E} can talk to each other.
           Neither group can reach the other.

Resolution:
  1. Each group counts its members.
  2. {A, B, C} has 3/5 = majority. Continues operating.
  3. {D, E} has 2/5 = minority. Enters read-only mode.
     - No new DSM writes (avoids conflicting updates).
     - No process migrations.
     - Local workloads continue (processes already on D, E keep running).
     - Remote page faults that target {A, B, C} get errors.
  4. When partition heals:
     - {D, E} rejoin the cluster.
     - DSM directory entries are reconciled (version numbers resolve conflicts).
     - Process affinity recalculated.

**Lease-Based Fencing Tokens**: To prevent split-brain ambiguity, ISLE uses
monotonically-increasing fencing tokens (lease epochs) for all cluster-wide
operations:

```rust
/// Fencing token for split-brain prevention.
/// Monotonically increments on every quorum leadership change.
/// Used to invalidate stale operations from partitioned minorities.
#[repr(C)]
pub struct FencingToken {
    /// Monotonically increasing epoch.
    /// Incremented when:
    ///   1. Cluster membership changes (node join/leave)
    ///   2. Quorum leader changes (leader failure)
    ///   3. Admin triggers manual fencing (maintenance)
    pub epoch: u64,

    /// Node ID of the quorum leader that issued this token.
    /// Used for token validation and leader identification.
    pub leader_node: NodeId,

    /// Timestamp when this token was issued (cluster-relative, PTP-synchronized).
    /// Tokens expire after FENCING_TOKEN_TTL_NS (default: 30 seconds).
    pub issued_at_ns: u64,
}

/// Token time-to-live: tokens older than this are considered invalid.
/// Must be longer than the worst-case partition detection time (heartbeat timeout).
pub const FENCING_TOKEN_TTL_NS: u64 = 30_000_000_000; // 30 seconds

Token propagation protocol:

Leader election: When cluster membership changes, the new quorum leader (determined by deterministic rules below) increments the fencing epoch and broadcasts the new FencingToken to all nodes in its partition.
Token validation: Every DSM write and capability grant includes the sender's current fencing token. The receiver rejects operations with stale tokens (token.epoch < local_token.epoch) or expired tokens (now - token.issued_at_ns > FENCING_TOKEN_TTL_NS).
Partition healing: When partitions heal, the surviving leader's fencing token takes precedence. Nodes from the minority partition discard their stale tokens and adopt the majority's token before resuming normal operations.

Deterministic Tie-Breaking Rules: To ensure all nodes independently reach the same decision during a partition, ISLE applies the following rules in strict order:

Rule 1: Majority Wins — The partition with >50% of nodes wins. - 5-node cluster: 3+ nodes wins. - 6-node cluster: 4+ nodes wins.

Rule 2: Larger Node Set Wins (for equal-size partitions) — If two partitions have exactly the same number of nodes (possible only in even-sized clusters), the partition with the higher total node ID sum wins. - Example: In a 4-node cluster {A=1, B=2, C=3, D=4}, partition {A, D} (sum=5) loses to partition {B, C} (sum=5) by rule 2a below. - Rule 2a: Equal Sum → Lowest Minimum Node ID Wins: If sums are equal (as in the example), the partition whose lowest node ID is smaller wins. {A, D} has min=1; {B, C} has min=2. {A, D} wins. - This provides a total order over all possible equal-size partitions without requiring external coordination.

Rule 3: External Witness (optional) — An admin-configured external witness (node ID 255, or a dedicated witness VM) can act as a tiebreaker. If configured, the partition containing the witness wins all ties. This overrides Rules 2/2a.

Rule precedence: Rule 1 > Rule 3 > Rule 2 > Rule 2a.

Implementation for 64-node clusters with fixed-size bitmasks:

/// Deterministically select the winning partition from two candidates.
/// Returns true if partition_a wins over partition_b.
///
/// # Preconditions
/// - Partitions are disjoint (no node in both)
/// - Neither partition is empty
pub fn partition_wins(partition_a: u64, partition_b: u64, cluster_size: u8,
                      witness_mask: u64) -> bool {
    let count_a = partition_a.count_ones();
    let count_b = partition_b.count_ones();

    // Rule 1: Majority wins
    let majority_threshold = (cluster_size as u32 + 1) / 2;
    if count_a > majority_threshold { return true; }
    if count_b > majority_threshold { return false; }

    // Rule 3: External witness (if configured)
    if witness_mask != 0 {
        let witness_in_a = (partition_a & witness_mask) != 0;
        let witness_in_b = (partition_b & witness_mask) != 0;
        if witness_in_a && !witness_in_b { return true; }
        if witness_in_b && !witness_in_a { return false; }
    }

    // Rule 2: Larger partition wins
    if count_a != count_b { return count_a > count_b; }

    // Rule 2a: Equal size → higher sum wins
    let sum_a = node_id_sum(partition_a);
    let sum_b = node_id_sum(partition_b);
    if sum_a != sum_b { return sum_a > sum_b; }

    // Rule 2a final tiebreaker: lower minimum node ID wins
    // (guaranteed to differ for disjoint partitions of equal size)
    min_node_id(partition_a) < min_node_id(partition_b)
}

/// Compute sum of node IDs in a partition bitmask.
fn node_id_sum(partition: u64) -> u32 {
    let mut sum = 0u32;
    let mut mask = partition;
    while mask != 0 {
        let lowest = mask.trailing_zeros();
        sum += lowest + 1;  // node IDs are 1-indexed
        mask &= !(1u64 << lowest);
    }
    sum
}

/// Find the minimum node ID in a partition bitmask.
fn min_node_id(partition: u64) -> u32 {
    partition.trailing_zeros() + 1  // node IDs are 1-indexed
}

Raft/Paxos for Critical Cluster Metadata: For cluster-critical state that cannot tolerate any inconsistency (security policies, node authentication keys, cluster configuration), ISLE uses Raft consensus rather than the simpler majority-quorum protocol:

Raft scope: Security policy replication, node certificate revocation, cluster-wide configuration changes (adding/removing nodes).
Quorum protocol scope: DSM page ownership, capability caching, heartbeat.
Rationale: Raft provides linearizability but requires persistent logging and leader election overhead. Using it only for critical metadata avoids imposing Raft's cost on the high-frequency DSM operations.

The Raft implementation is a simplified single-term leader election with log replication, managed by the ClusterMetadataReplicator service. Leader election uses the same deterministic rules above to avoid split-brain during Raft leader selection.

In-flight RDMA operations during partition: When a network partition occurs, RDMA NICs report completion with error status for all in-flight operations. KernelTransport translates these to TransportError::ConnectionLost. Callers (DSM fault handler, IPC ring) retry once (in case of transient link flap), then return an error to the process: SIGBUS for DSM page faults, EIO for IPC operations.

Minority partition DSM behavior: Nodes in the minority partition mark all DSM pages as SUSPECT. Write accesses to SUSPECT pages are blocked (write-protect in page tables); write attempts fault and return EAGAIN. Read accesses to SUSPECT pages continue to return locally-cached data (preserving availability for read-heavy workloads during brief partitions), but set a per-page stale_read flag. Applications can check whether they have read potentially stale data via madvise(MADV_DSM_CHECK), which returns EDSM_STALE if any SUSPECT pages were read since the last check. This ensures that stale reads are never silent — applications that care about consistency can detect and handle them, while applications that tolerate staleness continue without interruption. When the partition heals, SUSPECT pages are reconciled with the majority partition's directory and the SUSPECT marking is cleared.

Unreachable home node: Each DSM page has two home nodes: primary (determined by hash(region, VA) % cluster_size) and backup (determined by hash_backup(region, VA) % cluster_size using a different hash seed, guaranteed to differ from the primary).

Backup home node protocol: 1. Shadow directory maintenance: On every directory state change (ownership transfer, reader set update), the primary home node sends the updated DsmDirectoryEntry to the backup via RDMA Write to a pre-allocated shadow directory region on the backup node. Each entry includes a generation counter (incremented on every update) for consistency verification. 2. Consistency: The backup's shadow directory is write-only from the primary's perspective — the backup never modifies it independently. The generation counter ensures that stale writes (e.g., reordered RDMA Writes) are detected and discarded. The backup compares the incoming generation counter against its last seen value; only strictly incrementing updates are applied. 3. Failover: When the primary home node is declared Dead by the cluster membership protocol (Section 47.11), the new home node is determined deterministically — not by self-promotion. The membership protocol's NodeDead event triggers re-evaluation of the same hashing rule used for initial home placement (Section 47.5.3): hash(region, VA) % new_cluster_size with the dead node removed from the membership set. All surviving nodes compute the same result, producing a single deterministic new home. No node self-promotes. The new home is computed deterministically from the membership epoch, so all survivors agree. If the deterministic new home happens to be the node already holding the backup shadow directory, it promotes the shadow to primary. Otherwise, the backup transfers its shadow entries to the computed new home. Directory entries on the new home may be slightly stale (by at most one in-flight update). The version counter in each DsmDirectoryEntry allows requestors to detect and retry if they encounter a stale entry. 4. Partition healing: When the primary returns, the primary and backup reconcile their directories: for each entry, the node with the higher generation counter wins. The backup reverts to shadow mode after reconciliation completes.

For even-numbered clusters (no strict majority possible): - Admin designates a "tiebreaker" node (or external witness). - Or: smaller-numbered-node-set wins (deterministic, no external dependency). Caveat: the smaller-numbered-node-set heuristic is simple but has a known weakness — if the lower-numbered nodes are physically co-located, a power event affecting that rack consistently picks the wrong survivor set. For production deployments, an external quorum device (dedicated witness VM, or a 3rd-site arbitrator via RDMA or TCP heartbeat) is recommended. The heuristic remains as the default-fallback when no external witness is configured.


#### 47.11.4 DSM Recovery After Node Failure

Node B fails (holds exclusive ownership of some DSM pages):

Heartbeat timeout → Node B marked Dead.
For each DSM page where B was owner: a. Home node (determined by hash) still has the directory entry. b. If B was SharedOwner: readers still have valid copies. → Promote one reader to owner (pick closest to home node). c. If B was Exclusive: page data is LOST (only copy was on B). → Mark page as "lost." Processes faulting on it get SIGBUS. → Application must handle this (checkpoint/restart).
For each DSM page where B was a reader: a. Simply remove B from reader set. No data loss.
Capabilities issued by B expire naturally (bounded lifetime). Remote nodes stop accepting B-signed capabilities immediately.

Mitigation for exclusive page loss: - DSM regions can be created with DSM_REPLICATE flag (replication_factor = 2). - Every write to an exclusive page is mirrored to a backup node. - On failure: backup is promoted to owner. No data loss. - Cost: 2x write bandwidth for replicated regions.


**Design rationale — why replication is NOT the default:**

DSM is a **performance optimization**, not a durability mechanism. The default
(`replication_factor = 1`) is correct for the typical DSM use case:

1. **Ephemeral data**: Caches, temporary buffers, computational scratch space. Loss on
   node failure is acceptable — the data can be recomputed or reloaded from source.

2. **Read-mostly workloads**: Configuration data, reference tables, shared code pages.
   These are typically backed by persistent storage; losing the in-memory copy just
   means reloading from disk.

3. **Application-managed durability**: Databases, distributed file systems, and message
   queues implement their own replication and checkpointing. Adding DSM-level
   replication would be redundant and wasteful.

4. **Performance sensitivity**: The 2x bandwidth cost and ~15% latency overhead of
   synchronous replication would penalize all DSM users, even those who don't need it.

**When to enable replication:**
- DSM regions holding irreplaceable data without application-level persistence
- Workloads where recomputation cost exceeds replication cost
- Environments where node failure is frequent enough to justify the overhead

**Durability is the application's responsibility**: Just as applications using `mmap()`
or `malloc()` must implement their own persistence, DSM users must decide whether
their data warrants replication. The kernel provides the mechanism (`DSM_REPLICATE`);
the policy is left to the application.


#### 47.11.5 Clock Synchronization

Distributed capabilities (Section 47.9) and DSM timeouts rely on nodes having
synchronized clocks. ISLE requires PTP (IEEE 1588 Precision Time Protocol) as
the primary clock synchronization mechanism:

- **PTP grandmaster**: One node (or a dedicated PTP appliance) serves as the time
  reference. All other nodes synchronize to it via hardware PTP timestamping on the
  RDMA NIC.
- **Expected accuracy**: <1 μs with hardware PTP (typical for modern RDMA NICs with
  PTP hardware timestamping support).
- **NTP fallback**: If PTP is not available (no hardware support), NTP is used as a
  fallback. Expected accuracy: 1-10 ms. When using NTP, capability expiry grace period
  is increased to 100ms (from the default 1ms with PTP).
- **Maximum acceptable skew**: 1ms for PTP deployments. Capability expiry includes a
  1ms grace period to account for this skew. Nodes with clock skew exceeding 10ms
  trigger an FMA alert (reported as a `HealthEvent` with
  `class: HealthEventClass::Network`, event code `CLOCK_SKEW_EXCEEDED` — clock skew is
  a network-level health event, not its own event class).
- **Clock skew estimation**: Each heartbeat message (Section 47.11.2) includes the
  sender's timestamp. The receiver estimates one-way clock skew as
  `(remote_ts - local_ts - RTT/2)`. Persistent skew > 1ms triggers a PTP
  resynchronization.

---

### 47.12 CXL 3.0 Fabric Integration

#### 47.12.1 Why CXL Changes Everything

CXL (Compute Express Link) 3.0 provides **hardware-coherent** shared memory across
a PCIe fabric. Unlike RDMA (which requires software coherence protocols), CXL memory
is accessed via normal CPU load/store instructions with hardware cache coherence.

> **Hardware availability caveat**: CXL 3.0 hardware with full shared memory semantics
> (back-invalidate snooping for multi-host coherence) is not yet commercially available
> as of 2025. ISLE's CXL support targets CXL 2.0 Type 3 devices (available in Sapphire
> Rapids, Genoa) with software-managed coherence. CXL 3.0 back-invalidate snooping will
> be supported when hardware becomes available. The CXL 3.0 sections below describe the
> target architecture; implementation is gated on hardware availability.

Memory access latency spectrum:

Local DRAM (same socket): ~50-80 ns (load/store) Local DRAM (cross-socket): ~100-150 ns (load/store, QPI/UPI) CXL 2.0 attached memory: ~200-400 ns (load/store, PCIe + CXL) CXL 3.0 shared memory pool: ~200-500 ns (load/store, coherent) RDMA: ~2000-5000 ns (explicit transfer) NVMe SSD: ~10000 ns (block I/O)


CXL 3.0 shared memory is 5-25x faster than RDMA because:
1. No software protocol (hardware coherence via CXL.cache, CXL.mem)
2. Cache-line granularity (64 bytes, not 4KB pages)
3. No memory registration overhead
4. CPU load/store instructions, not DMA engine

#### 47.12.2 Design: CXL as a First-Class Memory Tier

`PageLocation` (defined canonically in Section 43.1.5 of 11-accelerators.md,
reproduced in Section 47.5.5 above) is reused here for distributed page placement
decisions. See Section 47.5.5 for the full enum definition including RDMA-specific
variants.

```rust
// Extend NumaNodeType (from Section 43.1.3)

pub enum NumaNodeType {
    CpuMemory,
    AcceleratorMemory { /* ... */ },
    CxlMemory {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
    },

    // NEW: CXL 3.0 shared memory pool (visible to multiple nodes)
    CxlSharedPool {
        /// Latency from this CPU to the shared pool.
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
        /// Nodes that share this pool (u64 bitfield, MAX_CLUSTER_NODES = 64).
        sharing_nodes: u64,
        /// Hardware coherence protocol version.
        coherence_version: CxlCoherenceVersion,
    },
}

#[repr(u32)]
pub enum CxlCoherenceVersion {
    /// CXL 2.0: pooled memory, no hardware coherence between hosts.
    /// Kernel manages coherence via software protocol (like DSM Section 47.5).
    Cxl20Pooled     = 0,
    /// CXL 3.0: hardware-coherent shared memory.
    /// CPU cache coherence protocol extended across CXL fabric.
    /// No software coherence needed — hardware handles it.
    Cxl30Coherent   = 1,
}

47.12.3 CXL + RDMA Hybrid

In a realistic datacenter, both CXL and RDMA will coexist:

Rack-level (CXL 3.0 fabric, ~200-500 ns):
  ┌─────────┐   CXL    ┌─────────┐   CXL    ┌─────────┐
  │ Node 0  │◄────────►│ CXL     │◄────────►│ Node 1  │
  │ CPU+GPU │          │ Switch  │          │ CPU+GPU │
  └─────────┘          │ +Memory │          └─────────┘
                       │  Pool   │
                       └────┬────┘
                            │ CXL
                       ┌────┴────┐
                       │ Node 2  │
                       │ CPU+GPU │
                       └─────────┘

Cross-rack (RDMA, ~2-5 μs):
  Rack 0 ◄──── 400GbE RDMA ────► Rack 1

Memory tier ordering within this topology:
  Tier 1: Local DRAM (~80 ns)
  Tier 2: CXL shared pool (~300 ns)         ← same rack
  Tier 3: GPU VRAM (~500 ns)
  Tier 4: Compressed (~1-2 μs to decompress)
  Tier 5: Remote DRAM via RDMA (~3 μs)      ← cross rack
  Tier 6: Local NVMe (~12 μs)
  Tier 7: Remote NVMe via NVMe-oF/RDMA      ← cross rack

The kernel detects CXL and RDMA links automatically (device registry)
and builds the distance matrix accordingly. No manual configuration
of tier ordering — the measured latencies determine placement policy.

47.12.4 CXL Shared Memory for DSM

When CXL 3.0 hardware-coherent shared memory is available, the DSM protocol (Section 47.5) simplifies dramatically:

DSM over RDMA (software coherence):
  - Page fault → directory lookup → ownership transfer → RDMA page transfer
  - ~10-25 μs per fault
  - Software TLB invalidation protocol

DSM over CXL 3.0 (hardware coherence):
  - Map CXL shared pool pages into process address space
  - CPU load/store works directly (hardware coherence)
  - No page faults for coherence (hardware handles cache-line invalidation)
  - ~200-500 ns access latency (same as accessing CXL memory)
  - Software DSM protocol not needed for CXL-connected nodes

The kernel uses CXL shared memory when available (intra-rack),
and falls back to RDMA-based DSM for cross-rack communication.
Best transport is selected automatically per page.

DSM redundancy and DLM: For node pairs connected via a CXL 3.0 fabric, the software DSM page-ownership state machine (§47.5.2) is redundant — the hardware handles cache-line-granularity coherence without software page faults or ownership transfers. ISLE's DSM routing layer uses CxlPool transport (load/store) for these pairs and skips the full coherence protocol.

However, the Distributed Lock Manager (DLM, §47.6) remains fully required even with CXL 3.0. Hardware cache coherence does not provide mutual exclusion semantics for arbitrary data structures (spinlocks, reader-writer locks, cross-node transactions). DLM lock acquisition (LOCK_ACQUIRE, atomic CAS on lock words) continues to operate as specified — CXL just makes the lock words accessible via load/store rather than RDMA, which reduces lock round-trip latency but does not eliminate the need for the protocol.

CXL 3.0 node pair — what changes vs. RDMA:
  DSM page faults:       eliminated (hardware coherence)
  DSM ownership xfer:    eliminated (no protocol needed)
  DLM lock acquire:      unchanged in protocol, load/store instead of RDMA
  DLM deadlock detect:   unchanged
  Heartbeat / membership: unchanged
  Capability tokens:     unchanged

47.12.5 CXL Devices as ISLE Peers

The three CXL device types map to distinct ISLE peer operating models. The classification determines Mode A vs. Mode B, peer role (compute vs. memory-manager), and crash recovery behavior.

47.12.5.1 Type 1: Coherent Compute (CXL.cache only)

Type 1 devices have their own compute and a cache that participates in the host CPU's coherency domain via CXL.cache. They have no device-managed DRAM visible via CXL.

ISLE operating model: - Mode B peer (hardware-coherent transport) — CXL.cache IS the cache coherence protocol. Ring buffers placed in the shared region are coherent by hardware without any explicit cache flush or ownership protocol in software. - Full compute peer — runs ISLE or a firmware shim (Paths A/B). Participates in cluster membership, heartbeat, DLM, and DSM as a regular cluster node. - FLR cache flush requirement — see §47.2.5.2. On FLR the device must flush all CXL.cache dirty lines to host memory. Host waits for FLR completion before accessing the shared region.

47.12.5.2 Type 2: Coherent Compute + Device Memory (CXL.cache + CXL.mem)

Type 2 is the richest CXL peer type. The device has both a coherent cache (CXL.cache) AND device-local DRAM accessible to the host via load/store (CXL.mem).

ISLE operating model: - Mode B peer with bidirectional zero-copy — coherent in both directions: device cache participates in host coherency domain (CXL.cache); host CPU can directly load/store device DRAM as a NUMA memory tier (CXL.mem). Neither direction requires DMA or explicit transfer. - Memory tier AND compute peer — the device's DRAM is registered as NumaNodeType::CxlMemory (or CxlSharedPool if multi-host). The device's cores run ISLE and execute workloads. Same ClusterNode participates in both memory placement decisions and compute scheduling. - Ring buffers — can live in either host DRAM (coherent via CXL.cache) or device DRAM (coherent via CXL.mem + host load/store). Placement policy: ring buffers go in the lower-latency region as measured at runtime. - FLR cache flush — same requirement as Type 1. - Examples: future CXL-attached GPUs with HBM, CXL AI accelerators.

47.12.5.3 Type 3: Memory Expander (CXL.mem only, minimal compute)

Type 3 devices provide DRAM accessible to the host via CXL.mem. They have no accelerator compute from CXL's perspective, though they may have a small management processor (typically ARM Cortex-A5x or similar).

ISLE operating model — memory-manager peer, not compute peer: - The management processor runs a minimal ISLE instance (Path A via AArch64 build target) with a single responsibility: managing the DRAM pool. - What it does: monitors pool health, handles tiering decisions (hot/cold page migration across CXL memory sub-regions), manages optional compression or encryption, retires bad pages, reports ECC errors and media errors to the host cluster membership layer. - What it does NOT do: run application workloads, participate in DSM ownership transfers (the pool appears as a NUMA node to the host, not as a peer's memory), or require the full cluster protocol. It uses a lightweight subset: heartbeat + health reporting + pool management messages. - No DLM, no DSM page protocol: the Type 3 peer does not own pages in the distributed sense. The host NUMA allocator owns pages in the CXL pool; the management processor just monitors and maintains the physical medium.

CXL 3.0 multi-host pool: when a CXL switch connects multiple hosts to the same Type 3 pool, the management processor becomes a shared pool coordinator: - Arbitrates allocation between hosts (each host has a capacity reservation) - Reports pool-wide health events to all connected hosts - Does not arbitrate cache coherence (hardware does that via CXL.cache back-invalidate) - Does not run DLM for cross-host locking (ISLE DLM handles that over CXL.cache)

Crash recovery: see §47.2.5.5 CXL Type 3 section. DRAM persists; management layer is lost until management processor recovers. Pool transitions to ManagementFaulted state; uncompressed/unencrypted pages remain accessible.

47.12.5.4 CXL Switch as Fabric Manager Node

CXL 3.0 introduces intelligent CXL switches that route traffic between multiple hosts and multiple memory/compute devices. A CXL switch with embedded compute (ARM or RISC-V management processor) can run ISLE as a fabric manager node:

Topology discovery: the switch sees all CXL endpoints and can report the full fabric topology (which hosts can reach which memory pools, with latency measurements) to the ISLE cluster membership layer.
Routing policy: the switch can enforce traffic shaping, QoS, or bandwidth partitioning between hosts sharing the same CXL fabric.
No data plane participation: the switch does not own memory pages, run workloads, or participate in DLM. It is a topology and routing oracle.
This is a future capability: CXL 3.0 switch hardware with embedded compute is not yet commercially available (2025). The architecture is ready; the FabricManager node type and associated cluster messages are deferred to Phase 5+ (§47.16).

47.12.5.5 Summary Table

CXL Type	Compute	Memory	ISLE Peer Role	Transport	Crash Model
Type 1	Yes (CXL.cache)	No	Full compute peer	Mode B	FLR flush + standard recovery
Type 2	Yes (CXL.cache)	Yes (CXL.mem)	Compute + memory tier peer	Mode B	FLR flush + standard recovery
Type 3	No (mgmt only)	Yes (CXL.mem)	Memory-manager peer	Heartbeat + pool msgs	ManagementFaulted; DRAM persists
CXL Switch	Yes (mgmt only)	No	Fabric manager (future)	Topology msgs	Fallback to static topology

47.13 Linux Compatibility and MPI Integration

47.13.1 Existing RDMA Applications (Unchanged)

All existing Linux RDMA applications work through the compatibility layer:

Application	Linux Interface	ISLE Path
MPI (OpenMPI, MPICH)	libibverbs / libfabric	isle-compat RDMA compat layer
NCCL (multi-node GPU)	libibverbs + GDR	RDMA compat + GPUDirect RDMA
DPDK	ibverbs / EFA	RDMA compat
Ceph (msgr2)	RDMA transport	RDMA compat
Spark (RDMA shuffle)	libfabric	RDMA compat
Redis (RDMA)	ibverbs	RDMA compat

The compatibility layer (isle-compat/src/rdma/) translates Linux verbs API calls to KABI RdmaDeviceVTable calls. Binary libibverbs.so works without recompilation.

47.13.2 MPI Optimization Opportunities

MPI implementations can opt into ISLE-specific features for better performance:

Standard MPI on ISLE (no changes, works today):
  MPI_Send/MPI_Recv → libibverbs → RDMA NIC
  MPI_Win_create (shared memory window) → mmap + RDMA
  Performance: same as on Linux

MPI on ISLE with DSM (opt-in, future):
  MPI_Win_create → ISLE_SHM_MAKE_DISTRIBUTED
  MPI_Put/MPI_Get → direct load/store on DSM region
  Kernel handles page migration transparently.
  No explicit RDMA operations needed by MPI implementation.

  Benefit: MPI one-sided operations become load/store.
  Latency reduction: ~2-5 μs (RDMA verbs overhead) → ~3-5 μs (page fault)
  for first access, then ~50-150 ns for subsequent accesses (page is local).

  For iterative algorithms (most HPC): working set becomes local after
  first iteration. Subsequent iterations run at local memory speed.

47.13.3 Kubernetes / Container Integration

/sys/fs/cgroup/<group>/cluster.nodes
    # Which nodes this cgroup's processes can run on
    # "0 1 2 3" or "all"

/sys/fs/cgroup/<group>/cluster.memory.remote.max
    # Maximum remote memory for this cgroup

/sys/fs/cgroup/<group>/cluster.accel.devices
    # Allowed accelerators (including remote)
    # "node0:gpu0 node0:gpu1 node1:gpu0"

Kubernetes integration:
  - kubelet reads cluster topology from /sys/kernel/isle/cluster/
  - Device plugin exposes remote GPUs as schedulable resources
  - Pod spec: resources.limits: { isle.dev/gpu: 4, isle.dev/remote-gpu: 4 }
  - kubelet sets cgroup constraints; kernel handles placement

47.13.4 ISLE-Specific Cluster Interfaces

/sys/kernel/isle/cluster/
    nodes                   # List of cluster nodes with state
    topology                # Cluster distance matrix
    dsm/
        regions             # Active DSM regions
        stats               # DSM page fault / migration stats
    memory_pool/
        total               # Cluster-wide memory total
        available           # Cluster-wide memory available
        per_node/           # Per-node breakdown
    scheduler/
        balance_interval_ms # Global load balance interval
        migrations          # Process migration count
        migration_log       # Recent migrations (for debugging)
    capabilities/
        revocation_list     # Current revocation list
        key_ring            # Cluster node public keys

47.14 Integration with ISLE Architecture

47.14.1 Memory Manager Integration

The distributed memory features integrate with the existing MM at two points:

1. Page fault handler (extend existing):
   Current: fault → check VMA → allocate page / CoW / swap-in
   Extended: fault → check VMA → check PageLocationTracker:
     - CpuNode → standard local fault (unchanged)
     - DeviceLocal → device fault (Section 42, unchanged)
     - RemoteNode → RDMA fetch from remote node (NEW)
     - CxlPool → CXL load (hardware handles it) (NEW)
     - NotPresent → allocate locally (unchanged)

2. Page reclaim / eviction (extend existing):
   Current: LRU scan → compress or swap
   Extended: LRU scan → compress, OR migrate to remote node, OR swap
     - If remote memory is available and faster than swap: migrate
     - Decision based on cluster distance matrix + pool availability

47.14.2 Device Registry Integration

New device types in the registry:

ClusterFabric (virtual root for cluster topology)
  +-- rdma_link_0 (Node 0 ↔ Node 1, 200 Gb/s, 2.5 μs RTT)
  +-- rdma_link_1 (Node 0 ↔ Node 2, 200 Gb/s, 4.0 μs RTT)
  +-- cxl_link_0 (Node 0 ↔ CXL Pool 0, 64 GB/s, 300 ns)

RemoteNode (virtual device representing a remote machine)
  +-- Properties:
  |     node-id: 1
  |     state: "active"
  |     rtt-ns: 2500
  |     bandwidth-gbit-s: 200
  |     memory-total: 549755813888
  |     memory-available: 137438953472
  |     gpu-count: 4
  +-- Services published:
        "remote-memory" (GlobalMemoryPool)
        "remote-accel" (RemoteDeviceProxy for each GPU)
        "remote-block" (RemoteDeviceProxy for each NVMe)

47.14.3 FMA Integration

New FMA health events for distributed subsystem (Section 39):

Rule	Threshold	Action
RDMA link degraded	>10% packet retransmits / minute	Alert + reduce traffic
RDMA link down	Link-down event	Failover to TCP or isolate node
Remote node unresponsive	3 missed heartbeats (300ms)	Mark Suspect
Remote node dead	10 missed heartbeats (1000ms)	Mark Dead + reclaim
DSM page loss	Exclusive page on dead node	Alert + SIGBUS to process
Cluster split-brain	Membership views diverge	Quorum protocol (§47.11.3)
CXL memory error	Uncorrectable ECC on CXL pool	Migrate pages + Alert
Clock skew detected	>10ms drift between nodes	Alert (affects capability expiry)

47.14.4 Stable Tracepoints

New stable tracepoints for distributed observability (Section 40):

Tracepoint	Arguments	Description
`isle_tp_stable_dsm_fault`	node_id, remote_node, vaddr, latency_ns	DSM page fault
`isle_tp_stable_dsm_migrate`	src_node, dst_node, pages, bytes	DSM page migration
`isle_tp_stable_dsm_invalidate`	owner_node, reader_nodes, vaddr	DSM cache invalidation
`isle_tp_stable_cluster_join`	node_id, rdma_gid	Node joined cluster
`isle_tp_stable_cluster_leave`	node_id, reason	Node left/failed
`isle_tp_stable_cluster_migrate`	pid, src_node, dst_node, reason	Process migration
`isle_tp_stable_rdma_transfer`	src_node, dst_node, bytes, latency_ns	RDMA data transfer
`isle_tp_stable_remote_fault`	node_id, tier, vaddr, latency_ns	Remote memory access
`isle_tp_stable_global_pool_alloc`	node_id, remote_node, bytes	Global pool allocation
`isle_tp_stable_global_pool_reclaim`	node_id, bytes, reason	Global pool reclaim

47.14.5 Object Namespace

Cluster objects in the unified namespace (Section 41):

\Cluster\
  +-- Nodes\
  |   +-- node0\          (this machine)
  |   |   +-- State       "active"
  |   |   +-- Memory      "512 GB total, 384 GB free, 32 GB exported"
  |   |   +-- GPUs        → symlink to \Accelerators\
  |   +-- node1\
  |       +-- State       "active"
  |       +-- Memory      "512 GB total, 256 GB free, 64 GB exported"
  |       +-- RTT         "2500 ns"
  |       +-- Bandwidth   "200 Gb/s"
  +-- DSM\
  |   +-- region_0\
  |       +-- Size        "1073741824 (1 GB)"
  |       +-- Pages       "262144 total, 131072 local, 131072 remote"
  |       +-- Faults      "12345 total, 3.2 μs avg"
  +-- MemoryPool\
  |   +-- Total           "4096 GB (8 nodes × 512 GB)"
  |   +-- Available       "2048 GB"
  |   +-- LocalExported   "128 GB"
  |   +-- RemoteUsed      "64 GB"
  +-- Fabric\
      +-- Links           (RDMA link table with latency/bandwidth)
      +-- Topology        (switch-level fabric map)

Browsable via islefs:
  cat /mnt/isle/Cluster/Nodes/node1/RTT
  → 2500 ns

  ls /mnt/isle/Cluster/DSM/
  → region_0  region_1  region_2

47.14.6 Open Questions

The following items require further design work:

DSM replication protocol: Full design for replication_factor > 1, including write propagation, quorum reads, and consistency guarantees.
TCP fallback mechanism: When RDMA fails (NIC down, driver crash), transparently fall back to TCP sockets for kernel transport. Design must handle in-flight RDMA operations and re-establish connections.
Scale extension beyond 64 nodes: Data structure changes for >64-node clusters (extended bitfields, hierarchical directories).
io_uring migration: How to migrate io_uring instances during process migration (in-flight SQEs, registered buffers, registered files).
TCP socket migration: How to migrate established TCP connections (sequence numbers, window state, congestion state). May require connection proxy on source node.
Partial network partitions: Non-transitive reachability (A can reach B, B can reach C, but A cannot reach C). Current quorum model assumes transitive reachability.

47.15 Implementation Phasing

Component	Phase	Dependencies	Notes
KernelTransport (RDMA kernel API)	Phase 3	RDMA driver KABI	Foundation for everything
KernelTransport (TCP fallback)	Phase 3+	KernelTransport	TCP socket transport for RDMA-less environments; latency ~50-200μs vs ~2-3μs RDMA
Cluster join / topology discovery	Phase 3	KernelTransport	Basic cluster formation
Heartbeat + failure detection	Phase 3	KernelTransport	Must have before distributed state
Distributed Lock Manager (§31a)	Phase 3-4	KernelTransport, Heartbeat	RDMA-native DLM; prerequisite for clustered FS
Distributed IPC (RDMA rings)	Phase 3-4	KernelTransport, IPC	Natural extension of existing IPC
Pre-registered kernel memory	Phase 3-4	RDMA driver, IOMMU	Performance prerequisite
PageLocation RemoteNode variant	Phase 4	Memory manager	Small MM extension
DSM page fault handler	Phase 4	PageLocation, KernelTransport	Core DSM functionality
DSM directory (home-node hash)	Phase 4	DSM fault handler	Page ownership tracking
DSM coherence protocol	Phase 4-5	DSM directory	Multiple-reader / single-writer
Distributed capabilities (signed)	Phase 4	Capability system, Ed25519	Security foundation
Cooperative page cache	Phase 4-5	DSM, VFS	Distributed page cache
Global memory pool (basic)	Phase 5	DSM, cgroups	Remote memory as swap tier
Cluster scheduler	Phase 5	Global pool, DSM affinity	Process migration
Process migration	Phase 5	Cluster scheduler	Freeze/thaw + lazy page fetch
Remote device proxy	Phase 5	KernelTransport, AccelBase	Remote GPU/NVMe access
GPUDirect RDMA cross-node	Phase 5	P2P DMA, RDMA	GPU↔GPU across network
Split-brain resolution	Phase 5	Heartbeat, DSM	Quorum + fencing
CXL 2.0 pooled memory	Phase 5	Memory manager	CXL memory as NUMA node
CXL 3.0 shared memory	Phase 5+	CXL 2.0, DSM	Hardware-coherent DSM
Global memory pool (advanced)	Phase 5+	All of above	Full cluster memory management
Cluster-wide cgroup integration	Phase 5+	Cluster scheduler, global pool	Kubernetes-ready
DSM replication (fault tolerance)	Phase 5+	DSM, replication protocol	For critical workloads

Priority Rationale

Phase 3-4 (Foundation): RDMA transport + cluster formation + basic DSM. This makes ISLE cluster-aware and enables distributed IPC. MPI and NCCL workloads benefit immediately from kernel-native RDMA transport.

Phase 4-5 (Practical Wins): Cooperative page cache + global memory pool + signed capabilities. This is when distributed ISLE becomes genuinely useful: remote memory as a tier, shared file caching, and secure cross-node operations.

Phase 5+ (Competitive Advantage): Process migration, CXL integration, cluster-wide resource management. Features that no other OS provides. The kernel manages a cluster of machines as a single coherent system.

47.16 Licensing Summary

Component	IP Source	Risk
RDMA kernel transport	Original design (uses standard RDMA verbs)	None
DSM page coherence	Academic (published research: Ivy, TreadMarks, Munin, GAM)	None
Home-node directory	Academic (distributed hash table, published)	None
Global memory pool	Original design (extends NUMA model)	None
Cooperative page cache	Academic (published research)	None
Cluster scheduler	Original design (extends CBS)	None
Distributed capabilities	Original design (Ed25519 is public-domain)	None
CXL integration	CXL spec (public, royalty-free consortium)	None
Process migration	Academic (MOSIX concepts are published research)	None
Split-brain / quorum	Academic (Paxos, RAFT, published)	None
CRDT revocation list	Academic (Shapiro et al., published)	None

All components are either original design or based on published academic research and open specifications. No vendor-proprietary APIs or patented algorithms.

47.17 Comparison: Why Previous DSM Projects Failed and Why This Succeeds

Factor	Kerrighed / OpenSSI / MOSIX	ISLE Distributed
Kernel design	Bolted onto Linux (30M+ LOC, assumes single machine)	Designed from scratch with distribution in mind
Coherence granularity	Cache-line (64B) — false sharing kills performance	Page-level (4KB) — matches network latency
Hardware support	None (pure software coherence)	RDMA (2020s), CXL 3.0 (2025+)
Memory model	Patched Linux MM (invasive, broke on updates)	PageLocationTracker already supports heterogeneous tiers
Transport	TCP sockets (high overhead)	RDMA one-sided ops (zero remote CPU)
Security	Unix permissions (not network-portable)	Cryptographically-signed capabilities
Fault tolerance	Fragile (node failure = cluster crash)	Quorum-based, bounded-lifetime capabilities, graceful degradation
Application compat	Modified syscall layer, broke things	Standard POSIX + opt-in extensions
Maintenance burden	Thousands of patches across all subsystems	Clean integration points (PageLocation, IPC transport, KABI proxy)
Timing	2000s — hardware wasn't ready	2026 — RDMA ubiquitous, CXL arriving, AI demands it

48. SmartNIC and DPU Integration

48.1 Problem

DPUs (Data Processing Units) — NVIDIA BlueField, AMD Pensando, Intel IPU — are processors that sit on the network path. They have their own ARM cores, run their own OS, and can process network traffic, storage I/O, and security policies without using host CPU cycles.

The kernel needs a model for "this driver runs on the DPU, not on the host."

48.2 Design: Offload Tier

Extend the three-tier driver model (Section 6) with a concept of execution location:

// isle-core/src/driver/offload.rs

/// Where a driver's code executes.
/// `#[repr(C, u32)]` is required (not `#[repr(u32)]`) because the `Offloaded`
/// variant carries data fields. Plain `#[repr(u32)]` only works for fieldless
/// (C-like) enums.
#[repr(C, u32)]
pub enum DriverExecLocation {
    /// Driver runs on host CPU (standard: Tier 0, 1, or 2).
    Host        = 0,

    /// Driver runs on a DPU/SmartNIC.
    /// Kernel communicates with it via PCIe mailbox or shared memory.
    /// Same KABI vtable interface, different transport.
    Offloaded {
        /// Device ID of the DPU in the device registry.
        dpu_device: DeviceNodeId,
        /// Communication mechanism.
        transport: OffloadTransport,
    },
}

#[repr(u32)]
pub enum OffloadTransport {
    /// PCIe mailbox (register-based, low-latency, small messages).
    PcieMailbox = 0,
    /// Shared memory region (mapped via PCIe BAR, bulk data).
    SharedMemory = 1,
    /// RDMA (if DPU has RDMA capability).
    Rdma = 2,
}

48.3 How It Works

Standard driver (host):
  Process → syscall → ISLE Core → KABI vtable call → Driver (host CPU)

Offloaded driver (DPU):
  Process → syscall → ISLE Core → KABI vtable call →
    → OffloadProxy → PCIe mailbox → DPU driver → Hardware

The OffloadProxy implements the same KABI vtable as the real driver.
It translates vtable calls into DPU mailbox messages.
ISLE Core cannot tell the difference.

This is the same proxy pattern used for remote devices in the distributed kernel (Section 47.10). A DPU is essentially a very close "remote node" connected via PCIe instead of RDMA.

48.4 Use Cases

Scenario	Host Path	DPU Path	Benefit
Network firewall	CPU processes every packet	DPU processes packets, host sees only allowed traffic	CPU freed for applications
NVMe-oF target	CPU handles RDMA + NVMe	DPU handles RDMA + NVMe, host CPU uninvolved	Zero host CPU for storage serving
IPsec / TLS	CPU encrypts/decrypts	DPU encrypts/decrypts	CPU freed, lower latency
vSwitch (OVS)	CPU handles VM networking	DPU handles VM networking	Major CPU savings in cloud
Telemetry	CPU collects and sends metrics	DPU collects and sends	No host CPU overhead

48.5 DPU Discovery and Boot

DPU discovery is via standard PCIe enumeration (the DPU appears as a PCIe device). Firmware loading is vendor-specific: BlueField uses UEFI on the DPU's ARM cores, Intel IPU uses its own loader. The host kernel's role: 1. Enumerate DPU as PCIe device (standard PCI scan at boot). 2. Establish communication channel (PCIe mailbox or shared memory BAR). 3. KABI vtable exchange: host and DPU exchange offload capabilities. 4. DPU firmware loading is out of scope — managed by DPU's own boot chain.

48.6 DPU Failure Handling

DPU crash / reboot / PCIe link failure:

1. PCIe error detected (AER or link-down notification).
2. OffloadProxy marks all pending I/O as failed:
   - Pending network I/O: -EIO to waiting processes.
   - Pending storage I/O: -EIO, filesystem handles error.
   - Pending control operations: -ENODEV.
3. Notify userspace via udev/netlink: DPU offline event.
4. Attempt DPU recovery:
   a. PCIe hot-reset (if supported by hardware).
   b. Wait for DPU firmware to re-initialize.
   c. Re-establish mailbox communication.
   d. Re-exchange KABI vtables.
5. If recovery succeeds: resume offloaded operations.
6. If recovery fails: fall back to host-side driver
   (if one exists for the offloaded function).
   Example: DPU was offloading NIC → fall back to host NIC driver.
   Not all functions have host fallbacks — admin notified.

48.7 Shared State Consistency

DPU and host may share data-plane state (flow tables, packet counters) via PCIe shared memory:

Source of truth:
  Data plane (flow entries, counters): DPU is authoritative.
    DPU processes packets at line rate. Host reads are stale by ~μs.
  Control plane (policy, configuration): Host is authoritative.
    Host writes policy. DPU reads and applies.

Consistency model:
  Shared memory uses producer-consumer pattern with sequence counters.
  No locks across PCIe (latency too high for spinlocks).
  Host writes are fenced (SFENCE) before signaling DPU via mailbox.
  DPU writes are fenced before updating sequence counter.
  Host reads check sequence counter for torn-write detection.

DPU Firmware Update Lifecycle:

Updating DPU firmware while services are running:

Migrate all offloaded functions back to host-side drivers (if host fallbacks exist).
Quiesce the DPU: drain in-flight I/O, flush shared-memory state.
Apply firmware update via vendor-specific mechanism (BlueField: bfb-install, Intel IPU: vendor tool).
DPU reboots with new firmware.
Re-establish KABI vtable exchange (new firmware may have new capabilities).
Re-offload functions that were migrated to host.

If no host fallback exists for an offloaded function, the update requires a maintenance window (the function is unavailable during DPU reboot).

DPU Multi-Tenancy:

In cloud environments, a single DPU serves multiple VMs/containers:

SR-IOV VFs: The DPU's NIC exposes SR-IOV Virtual Functions, one per tenant. Each VF is passed through to a VM via IOMMU. Hardware isolates VF traffic.
Hardware flow classification: The DPU's embedded switch classifies packets by flow (5-tuple or VXLAN VNI) and routes to the correct VF.
Per-VF offload: Each tenant's offloaded functions (firewall rules, encryption keys, QoS policies) are isolated per VF. The host kernel's cgroup hierarchy maps to VF assignments.

Offload Decision Criteria:

The kernel decides whether to offload a function to the DPU based on:

Capability: Does the DPU support this function? (Checked via KABI vtable exchange.)
Admin policy: /sys/kernel/isle/dpu/<id>/offload_policy — auto, manual, disabled.
Intent integration: If the cgroup's intent.efficiency is high, prefer DPU offload (reduces host CPU power). If intent.latency_ns is very low, check whether the PCIe mailbox round-trip (~1-2μs) exceeds the latency budget.
Load: If the DPU's embedded cores are saturated (utilization > 90%), do not offload additional functions.

48.8 Performance Impact

When using DPU offload: the host CPU does LESS work. Performance improves for host applications because infrastructure processing moves to the DPU.

Overhead: one PCIe mailbox round-trip (~1-2 μs) per KABI vtable call that crosses the host-DPU boundary. But DPU-offloaded drivers handle the fast path entirely on the DPU — the host only sees control-plane operations (setup, teardown, configuration changes).