Sections 47–48 of the ISLE Architecture. For the full table of contents, see README.md.
Part XII: Distributed Systems
Distributed kernel architecture and SmartNIC/DPU integration.
47. Distributed Kernel Architecture
This section extends ISLE from a single-machine kernel to a distributed-capable kernel with RDMA-native primitives, page-level distributed shared memory, cluster-aware scheduling, global memory pooling, and network-portable capabilities. These are kernel-internal capabilities — distributed applications remain in userspace, but the kernel transparently optimizes data placement and movement across RDMA fabric.
Design Constraints:
- Drop-in compatibility: A single-node ISLE system behaves identically to a non-distributed kernel. All distributed features are opt-in. Existing Linux binaries (MPI, NCCL, Spark, Redis, PostgreSQL) work without modification.
- Superset, not replacement: Standard TCP/IP sockets, POSIX shared memory, and SysV IPC work exactly as before. Distributed capabilities are additional. Applications that opt in (via new interfaces or transparent kernel policies) get better performance.
- RDMA-native from day one: The kernel's core primitives (IPC, page cache, memory management) are designed with RDMA transport in mind, not bolted on after the fact.
- Page-level coherence: Distributed memory coherence operates at page granularity (4KB minimum), not cache-line granularity (64B). This is the fundamental design decision that makes distributed shared memory practical over network latencies.
- Graceful degradation: Node failures are handled. No single point of failure. Split-brain is detected and resolved. Partial cluster operation is always possible.
47.1 Motivation
47.1.1 The Hardware Shift
The datacenter is becoming a single, disaggregated computer:
2015: Machines are islands. 10GbE, TCP/IP, millisecond latencies.
Networking = slow, unreliable. Kernel = local machine only.
2020: RDMA everywhere. 100GbE RoCEv2 / InfiniBand HDR (200Gb/s).
1-2 μs latency. Kernel-bypass networking is the norm for HPC.
2024: CXL 2.0 memory pooling. Disaggregated memory over PCIe fabric.
Memory can live outside the machine, accessible at ~200-400ns.
2025-2026: CXL 3.0 hardware-coherent shared memory. PCIe 6.0 (64 GT/s).
400GbE / InfiniBand NDR (400Gb/s). Sub-microsecond RDMA.
2027+: CXL switches, memory fabric topology, composable infrastructure.
The distinction between "local" and "remote" memory blurs.
The hardware is converging on a model where: - Network latency (RDMA) ≈ remote NUMA latency ≈ CXL latency - All three are ~1-5 μs, compared to NVMe SSD at ~10-15 μs - Remote memory over RDMA is faster than local SSD
No existing operating system is designed for this reality.
47.1.2 Why Linux Cannot Adapt
Linux has two completely separate networking paradigms:
World A: Socket-based (kernel-managed)
- TCP/IP, UDP, Unix sockets
- Kernel manages connections, buffers, routing
- Page cache, VFS, block I/O all use this world
- Latency: ~5 μs per packet (kernel overhead)
World B: RDMA/verbs (kernel-bypass)
- InfiniBand verbs, RoCE
- Application manages everything via libibverbs
- Kernel provides setup (protection domains, memory registration) then gets out
- Page cache, VFS, block I/O know nothing about this world
- Latency: ~1-2 μs per operation
These worlds do not interact. There is no way for the Linux page cache
to fetch a page from a remote node via RDMA. There is no way for the
Linux scheduler to migrate a process to where its data lives across
an RDMA link. There is no way for Linux IPC to transparently extend
to a remote node.
Most distributed features in Linux (DRBD, Ceph, GlusterFS, GFS2, OCFS2)
are built on World A (sockets). NFS and SMB have added RDMA transports
(svcrdma/xprtrdma since 2.6.24, SMB Direct since v4.15, KSMBD since
v5.15), but these are bolt-on transport alternatives — the core protocols,
data structures, and failure handling remain socket-oriented. None use
RDMA verbs (CAS, FAA, one-sided Read/Write) for lock-free data structure
access or coherence protocols. ISLE's distinction is using RDMA atomics
and one-sided operations as the primary coordination primitive, not just
as a transport.
Previous attempts to add distributed capabilities to Linux (Kerrighed, OpenSSI, MOSIX, SSI clusters, GAM patches) all failed because:
- Linux's core subsystems (MM, scheduler, VFS, IPC) assume single-machine
- Patches touched thousands of lines across dozens of subsystems
- Every kernel update broke the patches
- Cache-line-level coherence over network was too expensive
- No clean abstraction boundary — distributed logic was smeared everywhere
47.1.3 ISLE's Structural Advantage
ISLE's existing architecture is uniquely suited for distributed extension:
| Existing Feature | Distributed Extension |
|---|---|
| NUMA-aware memory manager (per-node buddy allocators) | Remote node = distant NUMA node |
| PageLocationTracker (Section 43.1.5) | Already tracks CPU, GPU, compressed, swap — add RemoteNode |
| MPSC ring buffer IPC | Ring buffer maps naturally to RDMA queue pair |
| Capability tokens (generation-based revocation) | Cryptographic signing → network-portable |
| Device registry (topology tree) | Extend topology to include RDMA fabric |
| AccelBase KABI (Section 42) | GPU on remote node = remote accelerator |
| CBS bandwidth guarantees (Section 15) | Extend CBS to cluster-wide resource accounting |
| Object namespace (Section 41) | \Cluster\Node2\Devices\gpu0 |
The key insight: ISLE already models heterogeneous memory (CPU RAM, GPU VRAM, compressed pages, swap) as different tiers in a unified memory hierarchy. Remote memory over RDMA is just another tier. The memory manager already knows how to migrate pages between tiers on demand. Extending "tiers" to include remote nodes is a natural generalization, not a fundamental redesign.
47.2 Cluster Topology Model
47.2.1 Extending the Device Registry
The device registry (Section 7) models hardware topology as a tree with parent-child and provider-client edges. For distributed operation, extend the tree to span multiple nodes:
\Cluster (root of distributed namespace)
+-- node0 (this machine)
| +-- pci0000:00
| | +-- 0000:41:00.0 (GPU)
| | +-- 0000:06:00.0 (NVMe)
| | +-- 0000:03:00.0 (RDMA NIC, mlx5)
| +-- cpu0 ... cpu31
| +-- numa-node0 (512GB DDR5)
| +-- numa-node1 (512GB DDR5)
|
+-- node1 (remote machine, discovered via RDMA fabric)
| +-- [remote device tree, cached]
| +-- numa-node0 (512GB DDR5, reachable via RDMA)
| +-- gpu0 (80GB VRAM, reachable via GPUDirect RDMA)
|
+-- node2 ...
|
+-- fabric (RDMA fabric topology)
+-- switch0 (InfiniBand switch)
| +-- port0 → node0:mlx5_0
| +-- port1 → node1:mlx5_0
+-- switch1
+-- port0 → node2:mlx5_0
+-- port1 → node3:mlx5_0
47.2.2 Cluster Node Descriptor
/// Describes a node in the cluster (including self).
#[repr(C)]
pub struct ClusterNode {
/// Unique node ID (assigned during cluster join, never reused).
pub node_id: NodeId,
/// Node state.
pub state: NodeState,
/// RDMA endpoint for reaching this node.
pub rdma_endpoint: RdmaEndpoint,
/// Total CPU memory available for remote access (bytes).
pub remote_accessible_memory: u64,
/// Number of NUMA nodes.
pub numa_nodes: u32,
/// Number of accelerators (GPUs, NPUs, etc.).
pub accelerator_count: u32,
/// Round-trip latency to this node (nanoseconds, measured).
/// u32 covers up to ~4.29 seconds, sufficient for datacenter and
/// campus-scale clusters. WAN links with higher RTT use the
/// TCP fallback transport (Section 47.14.6), not RDMA, and are
/// represented in the distance matrix (Section 47.2.4) instead.
pub measured_rtt_ns: u32,
/// Unidirectional bandwidth to this node (bytes/sec, measured).
/// Field name uses `bytes_per_sec` to avoid ambiguity with
/// "bps" (bits per second) common in networking contexts.
pub measured_bw_bytes_per_sec: u64,
/// Heartbeat: last received timestamp.
pub last_heartbeat_ns: u64,
/// Heartbeat: monotonic generation (detects restarts).
pub heartbeat_generation: u64,
/// Cluster protocol version (must match to join).
/// Nodes with mismatched protocol_version are rejected during cluster join.
pub protocol_version: u32,
pub _pad: [u8; 32],
}
pub type NodeId = u32;
#[repr(u32)]
pub enum NodeState {
/// Node is reachable and healthy.
Active = 0,
/// Node is joining (exchanging topology, syncing state).
Joining = 1,
/// Node missed heartbeats but not yet declared dead.
Suspect = 2,
/// Node is unreachable. Its resources are being reclaimed.
Dead = 3,
/// Node is gracefully leaving (draining work, migrating pages).
Leaving = 4,
}
#[repr(C)]
pub struct RdmaEndpoint {
/// RDMA GID (Global Identifier) — InfiniBand/RoCE address.
pub gid: [u8; 16],
/// Queue pair number for control channel.
pub control_qpn: u32,
/// Protection domain key for this cluster.
pub pd_key: u32,
/// RDMA device index on the local machine.
pub local_rdma_device: u32,
pub _pad: [u8; 12],
}
Device-local kernels as cluster members — The ClusterNode structure describes
traditional CPU-based compute nodes, but modern hardware increasingly runs its own
firmware OS:
- SmartNICs/DPUs: NVIDIA BlueField-2/3 DPUs run full Ubuntu with 8-16 ARM cores, 16-32 GB DRAM, and can host containers and VMs. Intel IPU and AMD Pensando DPUs run similar firmware stacks.
- GPUs: NVIDIA GPUs run CUDA firmware that schedules work across thousands of cores, manages HBM memory, and coordinates P2P transfers. AMD GPUs run ROCm firmware.
- Storage controllers: High-end NVMe controllers and RAID cards run embedded RTOS or Linux to manage flash translation layers, wear leveling, and caching.
- CXL devices: CXL defines three device types, each with a different operating model in ISLE's multikernel cluster:
- Type 1 (coherent compute, no device-managed memory): Device compute participates
in the host CPU cache coherency domain via
CXL.cache. Natural Mode B peer — ring buffers in shared memory are hardware-coherent without explicit flush. Examples: coherent FPGAs, smart NICs with CXL. - Type 2 (compute + device-managed memory): Both
CXL.cache(device cache in host coherency domain) andCXL.mem(device DRAM accessible to host via load/store). The richest ISLE peer type — bidirectional zero-copy coherent access. Device runs ISLE on its embedded cores, Mode B ring buffers are coherent in both directions. Examples: future CXL-attached GPUs, AI accelerators with HBM. - Type 3 (memory expansion, minimal or no compute): Provides additional DRAM via
CXL.mem; host sees it as a slower NUMA node. The tiny management processor (if present, typically ARM/RISC-V) acts as a memory-manager peer, not a compute peer: it manages tiering, compression, encryption, and error reporting for the pool, but does not run workloads. Examples: Samsung CMM-H, Micron CZ120, SK Hynix AiMM. See §47.12.5 for the full Type 3 operating model and crash recovery distinction.
Rather than treating these as passive devices controlled exclusively by the host kernel, ISLE's distributed design allows device-local kernels to participate as first-class cluster members. A BlueField-3 DPU running ISLE could:
- Join the cluster as a peer node with its own
NodeId, exchange topology with other nodes, and participate in membership/heartbeat protocols. - Expose resources in the device registry: its own CPUs, DRAM, and attached storage/network as remotely-accessible resources.
- Run workloads: containers or VMs can be scheduled on the DPU's cores, with distributed locking and DSM providing transparent access to host memory or other cluster nodes.
- Offload functions: RDMA transport, network filtering, encryption, compression, or storage can run on the DPU with kernel-level coordination via the distributed lock manager and DSM.
This multikernel model treats a single physical server as a cluster of heterogeneous kernels — one on the host CPU, one on each DPU, one on each GPU (if the GPU firmware exposes cluster primitives). The distributed protocols (membership, DSM, DLM, quorum) work identically whether communicating between physical servers or between the host and a DPU on the same PCIe bus.
Protocol requirements — For a device-local kernel to participate as a first-class cluster member, it must implement ISLE's inter-kernel messaging protocol. This is a wire protocol, not an API — the device kernel does not need to be ISLE itself, but it must speak the same language:
- Transport layer: RDMA (for network-attached nodes) or PCIe P2P MMIO+interrupts (for on-board devices like DPUs/GPUs). The device must expose:
- A control channel for cluster management messages (join, heartbeat, topology sync)
- A data channel for DSM page transfers and DLM lock requests
-
MMIO-mapped doorbell registers or MSI-X interrupts for signaling
-
Cluster membership protocol (Section 47.3):
- Implement the join handshake: authenticate, exchange topology, sync protocol version
- Send periodic heartbeats (every 100ms) with monotonic generation counter
- Respond to membership queries with node state (Active, Suspect, Dead, Leaving)
-
Participate in failure detection: mark other nodes as Suspect if heartbeats missed
-
DSM page protocol (Section 47.5):
- Accept page ownership transfer requests:
PAGE_REQUEST(vpfn, read|write) - Respond with page data or forward request if not owner
- Implement cache coherence state machine (Owner, Shared, Invalid)
-
Participate in invalidation broadcasts for write requests
-
DLM lock protocol (Section 47.6):
- Accept lock acquisition requests:
LOCK_ACQUIRE(lock_id, mode=shared|exclusive) - Maintain lock ownership table and grant/deny based on current holders
- Support one-sided RDMA lock operations (atomic CAS on lock words)
-
Implement deadlock detection timeout (10 seconds default)
-
Serialization format: All messages use fixed-size binary structs with explicit padding and versioning. Each message has a 32-byte header:
rust #[repr(C)] pub struct ClusterMessageHeader { pub protocol_version: u32, // Currently 1 pub message_type: u32, // JOIN_REQUEST, HEARTBEAT, PAGE_REQUEST, etc. pub node_id: NodeId, // Sender's node ID pub sequence: u64, // Message sequence number (for ordering) pub payload_length: u32, // Bytes following this header pub checksum: u32, // CRC32C of header + payload pub _pad: [u8; 8], }Payload structs are defined in Section 47.4 (message formats). All fields are little-endian. Nodes with mismatchedprotocol_versionare rejected during join.
Implementation paths for device vendors:
-
Path A: Full ISLE on device — Run ISLE kernel on the device's embedded CPU (e.g., BlueField DPU with 16 ARM cores runs ISLE natively). This gives full protocol support with zero extra work. The device becomes a cluster node indistinguishable from a regular server.
-
Path B: Firmware shim — Device vendor implements a minimal protocol adapter in their existing firmware. The adapter translates ISLE cluster messages into the device's native operations. Example: NVIDIA GPU firmware receives
PAGE_REQUESTmessages and responds by copying HBM pages to system memory via GPUDirect RDMA. Does not require rewriting the entire firmware stack. -
Path C: Host-side proxy driver — A Tier 1 driver on the host acts as proxy, translating ISLE cluster messages into device-specific commands (PCIe transactions, proprietary protocols). The device appears as a cluster member but is actually controlled by the proxy. Lower performance than Path B (extra CPU involvement) but requires zero firmware changes.
Near-term hardware targets — ISLE already builds for aarch64-unknown-none and
riscv64gc-unknown-none-elf. Devices with ARM or RISC-V cores can run the ISLE kernel
with zero ISA porting work, making Path A immediately actionable:
| Device | Cores | ISA | Path | Notes |
|---|---|---|---|---|
| NVIDIA BlueField-2 DPU | 8× Cortex-A72 | AArch64 | A | Replace host OS with ISLE. PCIe P2P to host. Currently runs Ubuntu. |
| NVIDIA BlueField-3 DPU | 16× Neoverse N2 | AArch64 | A | Same. Higher bandwidth NIC. |
| Marvell OCTEON 10 DPU | ARM Neoverse N2 | AArch64 | A | Open SDK. Same category as BlueField. |
| Microchip PolarFire SoC FPGA | 4× U54 | riscv64gc | A | ISLE boot target. FPGA implements custom datapath. Open toolchain. |
| StarFive JH7110 (VisionFive 2) | 4× U74 | riscv64gc | A | Boots ISLE today. PCIe expansion for host interconnect. |
| SiFive Intelligence X280 | U74 + RVV | riscv64gc | A | RISC-V vector AI accelerator. ISLE-compatible ISA. |
| Netronome Agilio CX (NFP3800) | NFP microengines | proprietary | B | Open C/BPF SDK. Published ring interface specs. Implement ISLE ring protocol in NFP firmware. |
| AMD/Xilinx Alveo U50/U250 | FPGA + ARM | AArch64 / FPGA | A or B | Fully programmable. Define any protocol in RTL. ISLE on embedded ARM for Path A. |
| Samsung SmartSSD | Zynq UltraScale+ (ARM + FPGA) | AArch64 | A or B | ARM Cortex-A53 runs ISLE. FPGA handles NVMe datapath. NVMe CSI spec published. |
| Samsung CMM-H | ARM management core | AArch64 | A (mgmt) | CXL Type 3 memory expander. Management core runs ISLE as a memory-manager peer (Type 3 model, §47.12.5). 256 GB–1 TB LPDDR5 pool. |
| Micron CZ120 | ARM management core | AArch64 | A (mgmt) | CXL Type 3. Same model as CMM-H. CXL 2.0, 128 GB–512 GB. |
| SK Hynix AiMM | ARM management core | AArch64 | A (mgmt) | CXL Type 3 with in-memory compute (PIM). Management core as memory-manager peer. |
The key insight: RISC-V devices are uniquely positioned as zero-effort Path A targets.
ISLE already cross-compiles to riscv64gc-unknown-none-elf with OpenSBI boot. Any device
with a RISC-V core and OpenSBI can boot an unmodified ISLE kernel — no porting required.
ARM-based DPUs (BlueField, OCTEON) are equally zero-effort via the AArch64 build target.
Security boundary — If a device firmware participates as a cluster member, it must be trusted to the same degree as any cluster node. A malicious or compromised GPU firmware with cluster membership could: - Request arbitrary memory pages via DSM (reading sensitive data) - Corrupt shared memory by writing to DSM pages - Initiate denial-of-service by flooding lock requests
Therefore, device cluster membership is disabled by default and enabled per-device:
echo 1 > /sys/bus/pci/devices/0000:41:00.0/isle_cluster_enabled
Only devices running signed, verified firmware (Section 6) should be granted cluster membership in secure environments.
Initially, only host kernels participate. Device participation is a Phase 5-6 capability (Sections 55-56) that requires firmware modifications by hardware vendors or open-source firmware projects. The protocol specification will be published as an RFC-style document to enable third-party implementations.
Real-world precedents: - Barrelfish multikernel OS: Research OS where each CPU core runs its own kernel instance, coordinating via message passing. ISLE generalizes this to heterogeneous hardware. - BlueField DPU offload: Current NVIDIA BlueField firmware can run OVS, storage targets, or custom applications, but coordination with the host is ad-hoc userspace protocols. ISLE provides kernel-level coordination. - GPU Direct Storage: NVIDIA GDS allows GPUs to directly access NVMe storage, bypassing the CPU. This is a point solution. ISLE's model makes such bypasses general-purpose.
Lightweight mode for intra-machine devices — The full distributed protocol (DSM page transfers, DLM with RDMA CAS, quorum protocols) was designed for multi-node clusters over RDMA networks with 1-5 μs latency. For intra-machine devices (host ↔ DPU/GPU on the same PCIe bus), the latency is 10x lower (~200-500ns), but failure modes are different (no network partitions, but devices can hang or reset independently).
Architectural principle: message passing is the primitive. ISLE's IPC model (Section 8a) is built on message passing — explicit ownership transfer, capability-mediated channels, and defined send/receive semantics. This is the architectural primitive. It composes uniformly across every boundary ISLE targets: intra-kernel, kernel-user, cross-process, cross-VM, cross-network, and hardware peer. The "shared memory fast path" described in Mode B below is not a competing model — it is how message passing is implemented locally when hardware cache coherency is available. The abstraction is always message passing; the transport is chosen to match the hardware.
Two coordination modes are supported:
Mode A: Full Distributed Protocol (default for network-attached nodes) - All messages via RDMA or PCIe P2P with ClusterMessageHeader wire format - DSM: explicit page ownership transfer with RDMA Read/Write - DLM: distributed lock tables, RDMA atomic CAS, deadlock detection - Membership: heartbeat every 100ms, suspect timeout 1 second - Best for: Multi-node clusters, devices that may have partial failures
Mode B: Hardware-Coherent Transport (optional for trusted local devices)
- The message-passing ownership guarantee is provided by the hardware cache coherency
protocol (PCIe ACS, CCIX, or CXL.cache) rather than by the software ownership
transfer protocol. The MESI state machine in hardware plays the same role that the
software ownership protocol plays over RDMA: only one writer at a time, cached reads
see the latest write. No software ownership messages are needed on top of hardware
coherency — that would be redundant.
- Both host and device map the same physical memory (via PCIe BAR mappings or pinned
system memory). Locks use local atomic ops (x86 LOCK CMPXCHG, ARM LDXR/STXR)
on cache-coherent shared memory instead of RDMA CAS.
- Membership via MMIO doorbell registers (no network heartbeat overhead).
- 5-10x lower latency: ~50-100ns for lock acquire vs ~500ns-1μs for RDMA CAS.
- Requires: device must be cache-coherent with host (PCIe ATS + ACS, CCIX, or CXL.cache).
Non-coherent devices (all current discrete GPUs, most current DPUs) must use Mode A.
When to use each mode:
| Scenario | Mode | Reason |
|---|---|---|
| Multi-node RDMA cluster | A | Must handle network failures, can't assume cache coherence |
| BlueField DPU running full ISLE | A | Separate memory space, needs explicit coordination |
| Future CXL 3.0-coherent GPU | B | CXL.cache coherency makes hardware the ownership protocol |
| Integrated GPU / APU (UMA) | B | CPU and GPU share coherent on-die/on-package memory |
| NVMe storage controller | B | Controller and host share command queues in coherent memory |
| Untrusted/unverified device | A | Mode B relies on hardware coherency guarantee — only for verified hardware |
Mode B is an optimization, not a separate protocol. It reuses the same data structures (lock tables, membership records) but the coherency guarantee is provided by hardware instead of software message passing. A device can fall back to Mode A if cache coherence fails or if the device resets.
Selection is per-device at join time:
# Force Mode A (full protocol) for untrusted DPU
echo "rdma://pci:0000:41:00.0?mode=distributed" > /sys/kernel/isle/cluster/join
# Enable Mode B (shared memory) for trusted GPU with cache-coherent access
echo "shmem://pci:0000:03:00.0" > /sys/kernel/isle/cluster/join
Implementation note: Mode B requires PCIe ATS (Address Translation Services) or CXL.cache to ensure device accesses to system memory are cache-coherent. Non-coherent devices (most current GPUs) must use Mode A.
47.2.2.1 Host-Side Component: isle-peer-transport
A device participating as a multikernel peer does not require a traditional
device driver on the host. A traditional driver (e.g., mlx5_core, amdgpu,
nvme) must understand the device's register layout, command format, initialization
sequence, error recovery procedure, and internal resource model. All of that
complexity now lives inside the device's own kernel. The host never touches device
registers and has no knowledge of the device's internals.
What the host requires instead is a single generic module — isle-peer-transport
— that handles the PCIe connection to any ISLE peer device, regardless of what the
device actually does:
/// Host-side state for one ISLE peer kernel, maintained by isle-peer-transport.
/// Identical structure for every peer device — NIC, GPU, storage, custom ASIC.
/// The device's function is irrelevant to this layer.
pub struct PeerTransport {
/// PCIe BDF of the peer device (for unilateral controls, see §47.2.5).
pub pcie_bdf: PcieBdf,
/// Shared memory region for the inbound/outbound domain ring buffer pair.
/// Allocated by the host, mapped into PCIe BAR by device at join time.
pub ring_region: DmaBuffer,
/// MMIO doorbell register: host writes here to signal the device.
pub doorbell_mmio: MmioRegion,
/// MMIO watchdog register: device writes a counter here every ~10ms.
pub watchdog_mmio: MmioRegion,
/// Cluster membership state (shared with the membership protocol §47.3).
pub health: PeerKernelHealth,
/// Negotiated cluster protocol version.
pub protocol_version: u32,
}
isle-peer-transport does five things and nothing else:
- Enumerate — detect that the PCIe device at a given BDF exposes the ISLE peer capability register (a new PCI capability ID assigned to the ISLE protocol).
- Connect — allocate the shared ring buffer region, map the device's MMIO doorbell, run the cluster join handshake (§47.2.3).
- Monitor — poll the MMIO watchdog counter and participate in the heartbeat protocol to detect device failure (§47.2.5.4).
- Contain — execute IOMMU lockout, bus master disable, and FLR if the device fails (§47.2.5.5).
- Disconnect — handle voluntary CLUSTER_LEAVE (planned update, shutdown).
isle-peer-transport has zero device-specific logic. The same binary handles
a BlueField DPU, a RISC-V AI accelerator, a computational storage device, and any
future device that implements the ISLE peer protocol. The host kernel's dependency
on device-specific code goes from hundreds of thousands of lines (per device class)
to zero.
47.2.2.2 Live Firmware Update Without Host Reboot
Because the host has no device-specific driver and no knowledge of device internals, firmware updates on the device are entirely the device's own responsibility. The host is not involved in the update itself — it only observes the device leaving and rejoining the cluster.
Update procedure from the device's perspective:
1. Device decides to update (admin command, automatic policy, or
device-side health check triggers it).
2. Device sends CLUSTER_LEAVE (orderly departure, not a crash).
ClusterMessageHeader { message_type: MEMBER_LEAVING, node_id: self }
3. Host receives CLUSTER_LEAVE, executes graceful shutdown:
- Migrates any workloads running on device to surviving nodes.
- Drains in-flight IPC channels and DSM operations (waits for
completions, does not issue new requests to the device).
- Revokes cluster membership cleanly (no IOMMU lockout, no FLR —
this is voluntary, not a crash; §47.2.5 crash path not taken).
Host is fully operational throughout. Zero disruption to workloads
not using this device.
4. Device updates its own firmware:
- For Path A (full ISLE kernel): device does a kernel rolling update
(§52) or reboots its own cores with new kernel image.
- For Path B (firmware shim): device applies vendor firmware update
procedure — entirely internal, host has no visibility.
- For all-in-one firmware: same, internal to device.
The host cannot observe what happens inside. It only knows the
device is absent from the cluster.
5. Device reinitializes hardware on new firmware (internal).
6. Device sends CLUSTER_JOIN with new protocol version and capabilities.
Host authenticates (verifies firmware signature per §22/47.2.3),
negotiates protocol version, exchanges updated topology.
Device rejoins — may announce new capabilities (e.g., firmware
added support for a new DSM extension).
7. Workloads migrate back to device (or new workloads scheduled).
Update cadence is fully independent per device:
| Component | Update authority | Host reboot required? |
|---|---|---|
| Host kernel | Host admin | No (§52 live evolution) or yes |
| Device firmware (any path) | Device / device admin | Never |
| ISLE cluster protocol | Negotiated at join | No (backward-compatible range) |
| Device hardware capabilities | Announced at rejoin | No |
A device can update its firmware multiple times per day. The host never reboots.
The host's isle-peer-transport module never changes. The only host-visible event
is the device being absent for the duration of the update (seconds to minutes,
device-dependent).
For Path A devices running full ISLE, §52 (Live Kernel Evolution) makes it possible to update individual kernel subsystems without even leaving the cluster — no CLUSTER_LEAVE at all. This is a future optimization requiring §52 to be implemented on the device kernel, but it is architecturally reachable.
47.2.2.3 Attack Surface Reduction
The shift from device-specific drivers to a generic peer transport has a significant security consequence. Traditional device drivers run in Ring 0 with full kernel privileges. A single memory-safety bug anywhere in driver code equals full kernel compromise. Driver code is the dominant source of kernel CVEs:
Linux kernel CVE distribution (approximate, 2020-2024):
~50% — driver bugs (memory safety, race conditions, use-after-free)
~15% — networking stack
~10% — filesystem
~25% — other subsystems
Lines of driver code per device class (approximate):
mlx5 (Mellanox NIC): ~150,000 lines in Ring 0
amdgpu (AMD GPU): ~700,000 lines in Ring 0
i915 (Intel GPU): ~400,000 lines in Ring 0
nvme (NVMe storage): ~15,000 lines in Ring 0
In the ISLE multikernel model:
isle-peer-transport (all devices, combined): ~2,000 lines in Ring 0
Device-specific code on device: lives behind IOMMU boundary
cannot reach host kernel memory
even if completely compromised
A vulnerability in device firmware (whether that is a full ISLE kernel, a firmware shim, or an all-in-one firmware) cannot escalate to the host kernel. The IOMMU is the hard boundary (§47.2.5.2). The firmware can be replaced or compromised entirely; the host kernel's critical structures (text, stacks, capability tables, scheduler state) remain unreachable.
The host's trust relationship with a peer device is: 1. At join time: verify firmware signature (§22) — the device presents a signed identity. If the signature is invalid, the join is rejected. 2. During operation: treat all messages as untrusted input — validate message type, version, checksum, and semantic correctness before acting. Same discipline as any network protocol. 3. On failure: IOMMU lockout bounds the damage regardless of what the device firmware does (§47.2.5).
This is a fundamentally different trust model than Ring 0 driver code, which must be trusted completely because there is no boundary between it and the kernel.
47.2.3 Topology Discovery
Cluster formation is explicit (no auto-discovery magic):
1. Admin configures cluster membership:
echo "rdma://10.0.0.2/mlx5_0" > /sys/kernel/isle/cluster/join
2. Kernel establishes RDMA control channel to target node:
- Create RDMA queue pair (reliable connected)
- Perform authenticated key exchange:
- Each node has an Ed25519 signing key pair (for authentication) and an X25519
Diffie-Hellman key pair (for key exchange)
- Nodes exchange X25519 public keys and authenticate them with Ed25519 signatures
- Shared secret derived via X25519 DH, then HKDF-SHA256 to derive session keys
- Mutual authentication via pre-shared cluster secret or PKI
3. Exchange topology information:
- Each node sends its device registry summary (devices, NUMA, accelerators)
- Each node sends its memory availability (total, available for remote access)
- Each node sends its RDMA capabilities (bandwidth, latency, features)
4. Measure link quality:
- Ping-pong latency measurement (RTT)
- Bandwidth probe (bulk RDMA write)
- Results stored in ClusterNode.measured_rtt_ns / measured_bw_bytes_per_sec
5. Fabric topology construction:
- If InfiniBand: query subnet manager for switch topology
- If RoCEv2: infer topology from latency measurements + LLDP
- Build cluster-wide distance matrix (for scheduling and placement)
6. Cluster is operational.
Heartbeat monitoring begins (RDMA send every 100ms per node).
47.2.4 Distance Matrix
The memory manager and scheduler need a cost model for data movement:
/// Cluster-wide distance matrix.
/// distance[src][dst] = cost in nanoseconds to move one page (4KB).
///
/// Examples (measured, not configured):
/// Local NUMA same-node: ~300ns (cache-hot) to ~3-5μs (cold, typical for migration)
/// Local NUMA cross-socket: ~500-800ns (4KB memcpy, QPI/UPI hop)
/// GPU VRAM (PCIe): ~1000-2000ns (4KB DMA over PCIe)
/// Remote node (same switch): ~3000-5000ns (RDMA RTT + DMA)
/// Remote node (cross-switch): ~5000-8000ns
/// CXL-attached memory: ~200-400ns
/// NVMe SSD: ~10000-15000ns
/// Remote NVMe (NVMe-oF/RDMA): ~15000-25000ns
/// HDD: ~5000000ns (5ms)
///
/// Key observation: remote RDMA is faster than local SSD.
/// Design constant: maximum cluster size. 64 nodes fits in a u64 bitmask
/// for efficient set operations (reader sets, membership views, allowed_nodes).
/// For clusters larger than 64 nodes, extend to `[u64; N]` or a hierarchical
/// directory (see Section 47.14.6 open questions).
pub const MAX_CLUSTER_NODES: usize = 64;
/// Heap-allocated via slab (too large for stack: ~49 KB).
///
/// **Allocation**: This struct is ~49 KB and MUST NOT be placed on the kernel stack
/// (which is 8-16 KB). It is allocated from the slab allocator at cluster join time
/// (one instance per cluster membership) and persists for the node's cluster lifetime.
/// The `Box<ClusterDistanceMatrix>` is stored in the `ClusterState` struct.
pub struct ClusterDistanceMatrix {
/// Node count (including self). Must be <= MAX_CLUSTER_NODES.
node_count: u32,
/// Distances[src_node * node_count + dst_node] = page migration cost (ns).
/// Symmetric for RDMA (full-duplex), but stored both directions
/// because asymmetric links exist (satellite, WAN).
distances: [u32; MAX_CLUSTER_NODES * MAX_CLUSTER_NODES],
/// Bandwidth[src_node * node_count + dst_node] = bytes/sec.
bandwidth: [u64; MAX_CLUSTER_NODES * MAX_CLUSTER_NODES],
/// Local NVMe 4KB random read latency in nanoseconds, measured at cluster join
/// time by calibrate_local_storage(). Used to decide whether to fetch a page
/// from a remote node or read it from local SSD. Default: 12,000 ns (12 μs) if
/// no local NVMe is present. Updated on storage topology changes.
ssd_cost_ns: u64,
}
impl ClusterDistanceMatrix {
/// Should we fetch a page from a remote node vs. from local SSD?
/// Returns true if remote fetch is expected to be faster.
pub fn prefer_remote_over_ssd(&self, local: NodeId, remote: NodeId) -> bool {
let remote_cost = self.page_cost_ns(local, remote);
// Measured from local NVMe at cluster join time via a calibration 4KB random
// read. Default fallback 12,000 ns if no local NVMe present.
// This field is set by ClusterDistanceMatrix::calibrate_local_storage() called
// during cluster join (Section 47.11) and updated on storage topology changes.
let ssd_cost = self.ssd_cost_ns;
remote_cost < ssd_cost
}
}
47.2.5 Peer Kernel Isolation and Crash Recovery
This section defines the isolation model for device-local kernels (Section 47.2.2 Path A/B) — devices running ISLE or an ISLE-protocol firmware as a first-class cluster peer. This model is fundamentally different from both Tier 1 driver crash recovery (Section 9) and remote node failure (Section 47.11), and must not be confused with either.
47.2.5.1 The Isolation Model Shift
With traditional drivers (Tier 1/Tier 2), ISLE operates a supervisor hierarchy: the host kernel loads, supervises, and recovers drivers. A Tier 1 driver crash is handled by the kernel — it detects the fault, revokes the domain, issues FLR, and reloads the driver binary. The kernel is always in control.
With a multikernel peer, this hierarchy does not exist. The device runs its own autonomous kernel on its own cores in its own physical memory. The host kernel:
- Does not load the device kernel
- Cannot inspect or modify the device kernel's private memory
- Cannot "reload" the device kernel the way it reloads a Tier 1 driver
- Cannot force the device into a known-good state without a full hardware reset
The isolation unit shifts from software domain (MPK/DACR keys, enforced by the kernel) to physical boundary (PCIe bus, IOMMU, enforced by hardware).
47.2.5.2 Host Unilateral Controls
Despite being a peer, the device is physically attached to the host's PCIe fabric. The host retains six unilateral controls it can exercise regardless of the device kernel's state — even against a buggy, wedged, or hostile peer. They form an escalating ladder from surgical containment to full hardware reset:
1. IOMMU domain lockout — The host programmed the device's IOMMU domain at join time. It can modify or revoke that domain at any time. Revoking the IOMMU domain prevents all further device DMA into host memory, hardware-enforced, with no cooperation from the device kernel. This is the primary containment action.
2. PCIe bus master disable — A single MMIO write to the device's Command register clears the Bus Master Enable bit. All PCIe transactions from the device are silently dropped by the root complex. Effective immediately; no cooperation needed.
3. Function Level Reset (FLR) — The host can issue FLR via the PCIe capability register. This resets all device hardware state: the device kernel dies, all in-flight DMA is cancelled, device cores reset. The device must re-initialize and rejoin the cluster. FLR is the multikernel equivalent of "driver reload" but takes device-reboot time (~1-30 seconds depending on firmware initialization) rather than driver-reload time (~50-150ms for Tier 1).
CXL Type 1/2 note: On CXL-attached peers with
CXL.cache(coherent cache in the host CPU's coherency domain), the CXL specification requires the device to flush all dirty cache lines back to host memory before the FLR completes. The host must not access the shared memory region until the FLR completion status is confirmed — cache lines in flight during FLR may be in an indeterminate state. ISLE's crash recovery sequence (§47.2.5.5) issues IOMMU lockout and bus master disable (steps 1-2) before FLR (step 7) precisely to ensure no new DMA or cache traffic originates from the device during the flush window.
4. PCIe Secondary Bus Reset (SBR) — The upstream bridge or root port can assert the Secondary Bus Reset bit in its Bridge Control register. This propagates a hard reset signal to all devices on that bus segment — more comprehensive than FLR (resets all PCIe functions on the device, not just one) but less disruptive than a full power cycle. Used when FLR is not supported or does not produce a clean reset.
5. PCIe slot power — The host can power-cycle the PCIe slot via the Hot-Plug Slot
Control register or via ACPI _PS3/_PS0 platform methods. This cuts then restores
power to the slot entirely. Reserved for devices that do not respond to FLR or SBR.
6. Out-of-band management (BMC/IPMI) — On server-class hardware, the Baseboard Management Controller (BMC) has independent hardware control over PCIe slot power and presence, completely bypassing the host OS. The BMC operates on a separate management plane with its own network interface and power rail. This means that even if the host kernel itself is hung or severely degraded, a management controller can power-cycle misbehaving device slots and restore the host's ability to re-enumerate them. ISLE treats the BMC as a platform capability, not an ISLE component — its availability depends on the hardware platform, and it operates outside ISLE's software control path.
The critical consequence of this escalating ladder is: a heartbeat timeout from a peer device is never a dead end. The host always has a hardware path to reset and re-enumerate the device, independent of device cooperation. The peer kernel model does not introduce any situation where a failed device permanently wedges the host.
These controls are one-sided and non-cooperative. The host can execute them without any message to the device, without the device kernel's acknowledgment, and even while the device kernel is executing arbitrary code. They represent the irreducible minimum of host authority over physically attached devices.
47.2.5.3 Isolation Comparison
Traditional Tier 1 driver:
Same Ring 0 address space as host kernel.
Isolation via MPK/DACR — software-enforced, escape possible via WRPKRU.
Host kernel supervises: detects crash, revokes domain, FLR, reload.
Recovery: ~50-150ms (driver reload).
Can corrupt: own domain memory (not host kernel critical structures).
Multikernel peer (Mode A — message passing):
Separate address space, separate physical DRAM, separate cores.
Isolation via IOMMU — hardware-enforced, physically unreachable.
Host kernel does NOT supervise: detects via heartbeat, acts unilaterally.
Recovery: escalating reset ladder (FLR → SBR → slot power → BMC/IPMI).
At least one reset path always available; heartbeat timeout is never a dead end.
Can corrupt: RDMA pool (IOMMU-bounded, excludes kernel critical structures).
Can leave dirty: distributed state (held locks, DSM page ownership).
Multikernel peer (Mode B — hardware-coherent):
Same as Mode A for isolation (IOMMU still bounds device DMA).
Additional risk: in-flight coherent writes to shared pool may be torn.
Recovery: same as Mode A + pool scan and lock force-release.
Remote cluster node (network-attached):
No physical connection. No unilateral PCIe controls.
Isolation: network boundary, no shared memory at all.
Recovery: membership revocation, DSM page invalidation.
Can corrupt: nothing on host (messages-only interface).
The key asymmetry: a peer kernel crash is more isolated than a Tier 1 driver crash (no shared Ring 0 address space, IOMMU is harder than MPK), but less controlled (no supervisor relationship, no fast reload, possible dirty distributed state).
47.2.5.4 Crash Detection
Two detection paths run in parallel for attached peer kernels:
Primary: Cluster heartbeat (both Mode A and Mode B) The standard membership protocol (Section 47.3): heartbeat every 100ms. Missed for 1 second → Suspect. Missed for 3 seconds → Dead. This covers all failure modes including silent crashes and wedged firmware.
Secondary: MMIO watchdog (required for Mode B; recommended for Mode A) The device firmware writes a monotonically increasing counter to a dedicated MMIO register every 10ms. The host polls this register on any DLM lock acquisition timeout or DSM page request timeout. Stale counter → immediate Suspect escalation. Detection latency: ~20ms instead of up to 1 second via heartbeat alone.
Mode B devices that do not implement the MMIO watchdog must not be granted hardware-coherent transport access. The faster detection is mandatory for Mode B because stale lock state in the coherent pool propagates to all cluster members holding locks in that region.
/// Per-peer crash detection state, maintained by the host kernel.
pub struct PeerKernelHealth {
/// NodeId of the peer kernel.
pub node_id: NodeId,
/// Last observed MMIO watchdog counter value.
pub last_watchdog_count: AtomicU64,
/// Timestamp of last watchdog read (nanoseconds since boot).
pub last_watchdog_ts_ns: AtomicU64,
/// Number of consecutive missed heartbeats.
pub missed_heartbeats: AtomicU32,
/// Current health state.
pub state: AtomicU32, // PeerHealthState enum
/// PCIe BDF for unilateral control operations.
pub pcie_bdf: PcieBdf,
/// Coordination mode (determines recovery procedure).
pub mode: PeerCoordinationMode, // ModeA | ModeB
}
#[repr(u32)]
pub enum PeerHealthState {
Active = 0,
Suspect = 1, // Missed heartbeats or stale watchdog — alert, escalate
Dead = 2, // Confirmed failed — execute recovery sequence
Faulted = 3, // FLR issued, waiting for device to rejoin
}
47.2.5.5 Recovery Sequence
On transition to Dead, the host kernel executes the following sequence.
Steps 1-3 are unilateral and non-cooperative. Steps 4-6 involve cleanup of
distributed state. Steps 7-8 are optional and admin-controlled.
Peer kernel crash recovery sequence:
1. IOMMU lockout (immediate, unilateral)
host.iommu.revoke_domain(peer.pcie_bdf)
— All device DMA into host memory blocked in hardware.
— All outstanding RDMA operations targeting this device's QPs are cancelled.
— Estimated time: <1ms (IOMMU domain table update + TLB shootdown).
2. PCIe bus master disable (immediate, unilateral)
host.pcie.clear_bus_master(peer.pcie_bdf)
— Device can no longer initiate any PCIe transactions.
— Defense-in-depth: belts-and-suspenders with IOMMU lockout.
— Estimated time: <1ms.
3. [Mode B only] Pool scan and lock force-release
host.rdma_pool.scan_and_release_locks(peer.node_id)
— Scan all DLM lock entries in the coherent pool owned by the dead peer.
— Force-release each held lock; set tombstone flag (LOCK_OWNER_DEAD).
— Waiters receive LOCK_OWNER_DEAD, not timeout.
— Scan all DSM directory entries owned by peer; mark as LOST.
— Estimated time: O(active_locks) — typically <10ms.
4. Cluster membership revocation
cluster.revoke_membership(peer.node_id)
— Broadcasts MEMBER_DEAD(peer.node_id) to all other cluster nodes.
— All nodes destroy their QP connections to the peer.
— Rkey for the peer's RDMA pool is invalidated.
— Estimated time: ~1 RTT to notify all members (~1-3ms on LAN).
5. DSM page invalidation
dsm.invalidate_all_pages_owned_by(peer.node_id)
— All DSM pages in the Owner or Shared state for the dead peer are
invalidated. Processes mapping these pages receive SIGBUS on next
access (same semantics as NFS server failure).
— Home-node responsibilities for pages hosted on the peer are
migrated to surviving nodes via the migration protocol (Section 47.5.3).
— Estimated time: O(pages_owned_by_peer) — typically <100ms.
6. Capability revocation
cap_table.revoke_all_for_node(peer.node_id)
— All capabilities granted to or from the peer are invalidated.
— Ongoing ring buffer channels to the peer are torn down.
— Estimated time: O(capabilities) — typically <10ms.
7. [Optional] Reset and device reboot (admin-controlled or auto-policy)
Escalating ladder — attempt each step in order, stop when device rejoins:
a. FLR: host.pcie.issue_flr(peer.pcie_bdf)
b. SBR: host.pcie.issue_sbr(peer.pcie_bdf) // if FLR fails or unsupported
c. Slot power cycle: host.pcie.power_cycle(peer.pcie_bdf) // last software resort
d. BMC/IPMI slot power: platform.bmc.power_cycle_slot(peer.slot_id) // OOB, if available
— Resets all device hardware state; device firmware re-initializes.
— If device re-joins cluster: resume normal operation.
— If device fails to re-join within timeout after all steps: mark as Faulted,
notify admin, do not attempt further resets automatically.
— At least one reset path (FLR or SBR) is always available for PCIe devices.
8. [Optional] Workload migration
scheduler.migrate_workloads_from(peer.node_id, policy)
— Workloads that were running on the peer's compute resources
(containers, VMs, scheduled tasks) are either:
a. Migrated to surviving nodes if they were checkpointable.
b. Terminated with SIGKILL if not checkpointable.
c. Suspended pending device recovery if short outage expected.
— Policy configured per-cgroup: migrate | terminate | suspend.
Total time to containment (steps 1-2): < 2ms. Total time to clean state (steps 1-6): < 200ms typical. Total time to full recovery (steps 1-8 with FLR): 1-30 seconds (device-dependent).
Compare to Tier 1 driver reload: 50-150ms total. The peer kernel recovery is slower because FLR + firmware re-initialization cannot be parallelized, and because distributed state cleanup (steps 4-6) is inherently network-speed rather than local-memory-speed. Steps 1-3 (containment) are comparable to or faster than Tier 1 domain revocation.
CXL Type 3 crash — distinct recovery model: When a CXL Type 3 memory-expander peer loses its management processor (crash, firmware fault, or reset), the recovery is fundamentally different from a compute peer crash:
- The DRAM cells do not disappear. The physical memory remains accessible to the
host via
CXL.memload/store — it is DRAM on a PCIe/CXL bus, not in the device's compute domain. The host can still read and write those pages. - The management layer is gone. Tiering decisions, compression metadata, encryption keys (if any), and bad-page tracking maintained by the management processor are no longer available. Compressed pages are unreadable until decompressed; encrypted pages are inaccessible if keys were held in device DRAM.
- Recovery action: The host marks the CXL pool as
ManagementFaulted(notDead). Pages without compression or encryption remain accessible normally. Pages with compression or encryption are migrated or treated as lost. The management processor is reset (FLR → re-initialize firmware); once recovered, it re-scans the pool metadata and rejoins as a memory-manager peer. - No workload migration needed: There are no workloads running on a Type 3 management processor. Workloads using the CXL pool memory continue running unaffected (if their pages were uncompressed/unencrypted). Only pool management operations (tiering, new allocation) are suspended during recovery.
The PeerKernelHealth state machine (§47.2.5.4) gains an additional state for
Type 3 peers: ManagementFaulted — memory accessible, management unavailable.
47.2.5.6 What Survives Peer Kernel Crash Intact
The host kernel's own state is fully protected:
| Component | Protected by | Survives peer crash? |
|---|---|---|
| Host kernel text, rodata | IOMMU (device cannot reach) | Always |
| Host kernel stacks | IOMMU (not in RDMA pool) | Always |
| Capability tables | IOMMU (not in RDMA pool) | Always |
| Scheduler state | IOMMU (not in RDMA pool) | Always |
| Host application memory | IOMMU (not in RDMA pool unless explicitly exported) | Always |
| RDMA pool — kernel structures | IOMMU bounds; pool scan on Mode B crash | Recovered in step 3 |
| RDMA pool — application DSM pages | Invalidated (step 5); app gets SIGBUS | Lost (must re-fetch or re-compute) |
| Other cluster nodes' state | Steps 4-6 clean up distributed references | Recovered |
| Host kernel stability | Nothing to crash | Never affected |
The host kernel never crashes due to a peer kernel crash. The IOMMU is the hardware guarantee; the recovery sequence is the software cleanup.
47.2.5.7 Relationship to Other Failure Handling Sections
-
Section 9 (Driver Crash Recovery): Covers Tier 1/Tier 2 driver failures where the host kernel is the supervisor. Peer kernel isolation is a peer relationship, not a supervisor relationship. Do not apply Section 9 procedures to peer kernels.
-
Section 47.11 (Cluster Membership and Failure Detection): Covers remote nodes connected over RDMA networks. Peer kernel recovery shares the membership protocol (steps 4-6) but adds the unilateral PCIe controls (steps 1-3) that are not available for network-attached nodes.
-
Section 48.6 (DPU Failure Handling): Covers the §48 offload tier model where the DPU acts as a managed device via the OffloadProxy KABI. If the same DPU runs a full ISLE peer kernel (Section 47.2.2 Path A), this section applies instead. The two models are mutually exclusive for a given device.
47.3 RDMA-Native Transport Layer
47.3.1 Design: Kernel RDMA Transport (isle-rdma)
Unlike Linux where RDMA is a separate subsystem used only by applications, ISLE integrates RDMA into the kernel's transport layer so that any kernel subsystem can use RDMA for data movement.
Linux architecture:
┌───────────────────────────────────────────────────┐
│ Kernel subsystems (MM, VFS, IPC, scheduler) │
│ └── All use: memcpy, TCP sockets, block I/O │
│ (no RDMA awareness) │
├───────────────────────────────────────────────────┤
│ RDMA subsystem (ib_core, ib_uverbs) │
│ └── Only used by: userspace apps via libibverbs │
│ (kernel subsystems don't use this) │
└───────────────────────────────────────────────────┘
ISLE architecture:
┌───────────────────────────────────────────────────┐
│ Kernel subsystems (MM, VFS, IPC, scheduler) │
│ └── All use: isle-rdma transport (when remote) │
│ Local path: same as before (memcpy, DMA) │
│ Remote path: RDMA read/write via isle-rdma │
├───────────────────────────────────────────────────┤
│ isle-rdma: unified RDMA transport for kernel use │
│ ├── Page migration (MM → remote NUMA node) │
│ ├── Page cache fill (VFS → remote page cache) │
│ ├── IPC ring (IPC → remote process) │
│ ├── Control messages (scheduler, capabilities) │
│ └── Userspace RDMA (libibverbs compat, unchanged) │
├───────────────────────────────────────────────────┤
│ RDMA hardware driver (mlx5, efa, bnxt_re, etc.) │
│ └── Implements RdmaDeviceVTable (Section 46) │
└───────────────────────────────────────────────────┘
47.3.2 Transport Abstraction
// isle-core/src/rdma/transport.rs (kernel-internal)
/// Unified transport for kernel-to-kernel communication.
/// Abstracts over RDMA hardware and provides the primitives that
/// other kernel subsystems (MM, IPC, page cache) use.
pub struct KernelTransport {
/// RDMA device used for kernel communication.
rdma_device: DeviceNodeId,
/// Protection domain for all kernel RDMA operations.
kernel_pd: RdmaPdHandle,
/// Pre-registered memory regions for fast page transfer.
/// Covers designated RDMA-eligible regions (default: 25% of RAM).
/// Avoids per-transfer memory registration overhead.
kernel_mr: RdmaMrHandle,
/// Per-remote-node connections (reliable connected QPs).
// O(1) indexed by NodeId. MAX_CLUSTER_NODES=64, so this array is ~N*sizeof(NodeConnection) bytes.
connections: [Option<NodeConnection>; MAX_CLUSTER_NODES],
/// Per-CPU completion queues, one per logical CPU.
/// A single shared CQ becomes a bottleneck under load because all CPUs
/// contend to drain it. Per-CPU CQs allow each CPU's RDMA poll thread
/// to drain its own queue independently, eliminating cross-CPU contention.
/// Each QP in NodeConnection is created against the CQ of the CPU that
/// owns that connection's polling thread (assigned at connection setup time).
/// Indexed by cpu_id; length = num_online_cpus() at transport init.
/// Uses a slab-allocated slice rather than a fixed array: the kernel has no
/// compile-time MAX_CPUS (Section 14.1) — CPU count is runtime-discovered.
/// Allocated once at transport init, never resized on the hot path.
cqs: SlabSlice<RdmaCqHandle>,
/// Statistics.
stats: TransportStats,
}
pub struct NodeConnection {
/// Reliable connected queue pair for control messages.
control_qp: RdmaQpHandle,
/// Reliable connected QP for one-sided RDMA (Read/Write/Atomic).
/// RDMA Read/Write requires RC or UC transport; UD only supports Send/Receive.
data_qp: RdmaQpHandle,
/// Remote node's memory region key (for RDMA read/write).
remote_rkey: u32,
/// Remote node's base address for page transfers.
remote_base_addr: u64,
/// Flow control: outstanding RDMA operations.
inflight: AtomicU32,
/// Maximum concurrent RDMA operations to this node.
max_inflight: u32,
}
/// Operations available to kernel subsystems.
impl KernelTransport {
/// RDMA Read: fetch a page from remote node to local memory.
/// One-sided — no remote CPU involvement.
/// Used by: MM (page fault on remote page), page cache (remote fill).
pub fn fetch_page(
&self,
remote_node: NodeId,
remote_phys_addr: u64,
local_page: PhysAddr,
size: u32, // Usually 4096 (4KB) or 2097152 (2MB huge page)
) -> Result<RdmaCompletion, TransportError>;
/// RDMA Write: push a page from local memory to remote node.
/// One-sided — no remote CPU involvement.
/// Used by: MM (page eviction to remote node), DSM (writeback).
pub fn push_page(
&self,
remote_node: NodeId,
local_page: PhysAddr,
remote_phys_addr: u64,
size: u32,
) -> Result<RdmaCompletion, TransportError>;
/// RDMA Atomic Compare-and-Swap (64-bit).
/// Used by: Distributed Lock Manager (Section 31a) uncontested acquire,
/// DSM page ownership transfers.
pub fn atomic_cas(
&self,
remote_node: NodeId,
remote_addr: u64,
expected: u64,
desired: u64,
) -> Result<u64, TransportError>;
/// Send a control message to remote node (two-sided, requires remote CPU).
/// Used by: cluster management, capability distribution, scheduler.
pub fn send_control(
&self,
remote_node: NodeId,
msg: &ControlMessage,
) -> Result<(), TransportError>;
/// Batch page transfer: move N pages in a single RDMA operation.
/// Used by: bulk migration, prefetch, process migration.
pub fn fetch_pages_batch(
&self,
remote_node: NodeId,
pages: &[(u64, PhysAddr)], // (remote_phys, local_phys) pairs
) -> Result<RdmaBatchCompletion, TransportError>;
}
47.3.3 Pre-Registered Kernel Memory
The biggest performance problem with Linux RDMA is memory registration. Before any RDMA operation, memory must be pinned and registered with the NIC hardware (translated to physical addresses, programmed into NIC's address translation table). This costs ~1-10 μs per registration.
ISLE avoids per-transfer registration overhead by pre-registering designated RDMA-eligible memory regions at cluster join time:
Boot sequence:
1. RDMA NIC driver initializes (standard KABI init).
2. Cluster join is requested.
3. isle-rdma registers RDMA-eligible memory regions:
- A configurable pool (default: 25% of RAM) for DSM pages and RDMA buffers
- Pool size set via /sys/kernel/isle/cluster/rdma_pool_percent (1-75%)
- Single memory region per pool, one-time cost (~100 μs)
- NIC can DMA to/from registered pages without per-transfer registration
- Non-registered memory cannot be accessed by remote nodes
4. Remote nodes exchange rkeys (remote access keys).
5. Any kernel subsystem can now RDMA read/write registered pages with zero
setup cost. Pages outside the RDMA pool are not remotely accessible.
This is safe because:
- Only designated RDMA-eligible regions are registered (not all physical memory)
- The rkey is only shared with authenticated cluster members
- RDMA access is gated by the connection (reliable connected QP)
- The kernel controls which pages are exported (via page ownership tracking)
- IOMMU still validates all DMA (RDMA NIC goes through IOMMU like any device)
- Kernel text, kernel stacks, and security-sensitive structures are never in
the RDMA pool
47.3.4 Performance Characteristics
| Operation | Latency | Bandwidth | CPU Involvement |
|---|---|---|---|
| RDMA Read (4KB page) | ~2-3 μs | ~200 Gb/s line rate | Zero on remote side |
| RDMA Write (4KB page) | ~1-2 μs | ~200 Gb/s line rate | Zero on remote side |
| RDMA Atomic CAS | ~2-3 μs | N/A | Zero on remote side |
| Control message (send/recv) | ~2-3 μs | N/A | Interrupt on remote side |
| Batch page transfer (64 pages) | ~5-10 μs | Near line rate | Zero on remote side |
| Memory registration (avoided) | 0 (pre-registered) | N/A | N/A |
Compare with alternatives: | Alternative | 4KB Fetch Latency | Notes | |------------|-------------------|-------| | Local DRAM (same NUMA) | ~300-500 ns | Baseline (4KB page copy) | | Local DRAM (cross-NUMA) | ~500-800 ns | QPI/UPI hop (4KB page copy) | | RDMA (same switch) | ~3 μs | 100x DRAM, but... | | CXL 2.0 pooled memory | ~200-400 ns | Hardware-coherent | | NVMe SSD | ~10-15 μs | Current swap target | | NFS/CIFS (TCP) | ~50-200 μs | Kernel networking overhead | | HDD | ~5,000 μs | Rotational latency |
Key insight: Raw RDMA Read latency (~3 μs) is 3-5x faster than NVMe (~10-15 μs). However, the complete DSM page fault path (directory lookup + ownership negotiation + RDMA transfer) totals ~10-18 μs (see Section 47.5.4), which is comparable to NVMe latency rather than 3-5x faster. The advantage of remote DRAM over NVMe is not single-fault latency but bandwidth scalability: RDMA provides ~25 GB/s per port (200 Gb/s InfiniBand) vs. ~7 GB/s for a single NVMe SSD, and multiple RDMA ports can be aggregated. Remote memory is a better "swap" target than local disk for bandwidth-bound workloads.
47.3.5 Security Considerations
The pre-registered RDMA pool approach (Section 47.3.3) creates a security trade-off: any node with the rkey can read/write any address within the RDMA-eligible region on the remote node via one-sided RDMA. The attack surface is limited to the RDMA pool (default 25% of RAM), not all physical memory — kernel text, stacks, and security structures are excluded.
Explicit trust model assumption: All nodes in an ISLE cluster are mutually trusted kernel instances, authenticated during cluster join (X25519 key exchange authenticated with Ed25519 signatures, Section 47.2.3). A compromised node can read/write the RDMA pool of any other node. This is the same trust model used by all production InfiniBand and RoCE deployments today. The cluster join authentication is the trust boundary — once joined, nodes are peers. If a node is suspected of compromise, its cluster membership is revoked (Section 47.11), destroying all QP connections and invalidating its RDMA mappings immediately.
Mitigation 1: RDMA Memory Windows (MW Type 2) — For security-sensitive or multi-tenant deployments, use RDMA Memory Windows instead of pool-wide registration. Each page export creates a short-lived memory window with a unique rkey, scoped to the specific page or region being transferred. The window is revoked when page ownership changes. This adds ~0.5-3μs overhead per window create/destroy (HCA-dependent; ConnectX-5/6 MW Type 2 bind/invalidate involves firmware interaction and PCIe round-trips) but eliminates pool-wide exposure. A compromised node can only access pages for which it currently holds a valid memory window, not the entire 25% pool.
Mitigation 2: Rkey rotation — In trusted mode, the pool-wide rkey is rotated periodically (default: every 60 seconds). All nodes exchange new rkeys via the authenticated control channel. Old rkeys are invalidated after a grace period (2x rotation interval = 120s). This limits the window of exposure if an rkey is leaked outside the cluster: a revoked node retains stale access for at most 180 seconds (rotation interval + grace period). For deployments where this window is unacceptable (multi-tenant, zero-trust), use Mitigation 1 (RDMA Memory Windows) instead, which provides immediate per-page revocation at the cost of higher per-operation overhead.
Mitigation 3: Trusted cluster mode (default) — For single-tenant, physically secured clusters (the common datacenter case), the pool-wide registration with rkey rotation is the default. This matches current InfiniBand practice — all production RDMA deployments assume a trusted fabric.
Mitigation 4: Hardware memory encryption — Note: Total Memory Encryption (Intel TME-MK, AMD SME) operates at the memory controller boundary and does NOT encrypt data visible to DMA devices. RDMA NICs access memory through the IOMMU in the CPU's coherency domain and see plaintext. TME-MK/SME protects against physical DRAM extraction attacks (cold boot) but does not provide defense-in-depth against software-based RDMA attacks. For RDMA data-in-transit protection, use the AES-GCM encryption described above.
Acknowledged limitation: All four mitigations have trade-offs. MW Type 2 adds latency to every page transfer. Rkey rotation adds periodic coordination overhead. Trusted cluster mode requires a physically secured fabric. Hardware encryption prevents cross-node data snooping but doesn't prevent a compromised node from writing garbage to remote memory. A fully zero-trust RDMA solution would require per-operation cryptographic MACs. For RoCE (RDMA over Converged Ethernet), MACsec (IEEE 802.1AE) provides link-layer encryption and integrity at the Ethernet layer, protecting against physical tapping and injection. For native InfiniBand, MACsec is not applicable (IB does not use Ethernet framing); InfiniBand relies on partition-based isolation (P_Key), queue-pair key authentication (Q_Key), and subnet manager access control for security. Neither MACsec nor InfiniBand's native mechanisms provide per-operation cryptographic authentication for one-sided RDMA operations at line rate. ISLE's pragmatic stance: match existing InfiniBand/RoCE security practice (trusted fabric with partition isolation), offer MW-based restriction for multi-tenant or security-sensitive deployments, and upgrade to hardware-enforced per-operation authentication when NIC vendors deliver it (CXL 3.0's integrity model is a likely candidate).
One-sided RDMA authentication: There is a fundamental tension between one-sided RDMA (which requires no remote CPU involvement) and per-operation authentication (which requires CPU processing). The resolution: the trust boundary is the cluster join, authenticated via X25519 key exchange with Ed25519 signatures (Section 47.2.3). Within the cluster, nodes are mutually trusted for RDMA operations. If a node is compromised, its cluster membership is revoked — all QP connections to that node are destroyed and its RDMA mappings are invalidated.
This matches the security model of all existing RDMA deployments: InfiniBand assumes a trusted fabric, and RoCEv2 relies on network isolation (VLANs, VXLANs) for multi-tenant separation.
Default: Trusted mode (pool-wide registration) is the default. This matches existing InfiniBand/RoCE practice — all production RDMA deployments assume a trusted fabric. Set via:
/sys/kernel/isle/cluster/security_mode
# "trusted" (default) or "secure"
Why trusted mode is the default: The security boundary in an ISLE cluster is the cluster join authentication (Section 47.2.3: X25519 key exchange authenticated with Ed25519 signatures). Once a node is authenticated into the cluster, it is a trusted peer. RDMA pool-wide access does not expand the trust boundary — a compromised node already has full access to cluster resources through authenticated channels.
Datacenter environments provide additional isolation layers: - Physical access control (racks, cages, badge readers) - Network isolation (VLANs, VXLANs, dedicated RDMA fabrics) - Node attestation during cluster join
For these environments, the ~5% overhead of memory windows is unnecessary overhead. Secure mode is appropriate for: - Multi-tenant cloud environments where different customers share RDMA fabric - Environments without physical network isolation - Defense-in-depth deployments where the threat model includes node compromise despite authentication
Switching to secure mode:
echo "secure" > /sys/kernel/isle/cluster/security_mode
# Note: This triggers a brief (~100ms) disruption as QPs are reconfigured
47.4 Distributed IPC
47.4.1 Extending Ring Buffers to RDMA
ISLE's IPC is MPSC ring buffer-based (Section 8, Zero-Copy I/O Path, which includes the MPSC ring buffer protocol), using SQE/CQE structures compatible with io_uring. The same ring buffer protocol works over RDMA.
Local IPC (current design, unchanged):
┌──────────┐ shared memory ring ┌──────────┐
│ Process A │ ──── SQE/CQE ────► │ Process B │
└──────────┘ (mapped pages) └──────────┘
Same machine, same address space region.
Zero-copy. Latency: ~200ns.
Distributed IPC (new):
┌──────────┐ RDMA ring ┌──────────┐
│ Process A │ ──── SQE/CQE ────► │ Process B │
│ (Node 0) │ (RDMA write) │ (Node 1) │
└──────────┘ └──────────┘
Different machines, RDMA-connected.
Zero-copy (RDMA, no kernel networking stack). Latency: ~2-3 μs.
The ring buffer protocol (SQE format, CQE format, head/tail pointers,
memory ordering) is identical. Only the transport changes.
47.4.2 Transparent Transport Selection
// isle-core/src/ipc/ring.rs (extend existing)
pub enum RingTransport {
/// Both endpoints on the same machine.
/// Ring is in shared memory (mapped into both processes).
SharedMemory {
ring_pages: PageRange,
},
/// Endpoints on different machines.
/// Submissions: producer RDMA-writes SQEs into consumer's ring memory.
/// Completions: consumer RDMA-writes CQEs back.
/// Doorbell: RDMA send (small control message) notifies consumer.
Rdma {
remote_node: NodeId,
remote_ring_addr: u64,
remote_ring_rkey: u32,
local_ring_pages: PageRange,
doorbell_qp: RdmaQpHandle,
},
/// Endpoints connected via CXL fabric (load/store accessible).
/// Ring is in CXL-attached shared memory. Same as SharedMemory
/// but pages are on a CXL memory pool node.
CxlSharedMemory {
ring_pages: PageRange,
cxl_node: NumaNodeId,
},
}
Transport selection is automatic:
Process A calls connect_ipc(target_process_id):
1. Kernel looks up target process location:
- Same machine → SharedMemory transport
- Remote machine with RDMA → Rdma transport
- CXL-connected pool → CxlSharedMemory transport
2. Kernel sets up ring buffer with appropriate transport
3. Process A gets back an IPC handle (opaque)
4. Process A uses the same SQE submission interface regardless of transport
5. The SQE/CQE format is identical in all cases
47.4.3 Ring Buffer RDMA Protocol
RDMA Ring Buffer Header Extension
The RDMA transport extends the base DomainRingBuffer header (Section 8a.2) with
additional fields for cross-node synchronization. These fields are placed in a
separate RdmaRingHeader that precedes the standard header:
/// RDMA-specific ring buffer header extension.
/// Placed immediately before the DomainRingBuffer in RDMA transport mode.
/// Consumer-owned fields are on a separate cache line from producer-owned fields.
#[repr(C, align(64))]
pub struct RdmaRingHeader {
// === Producer-owned cache line (written by remote producer via RDMA) ===
/// Sequence number of the last published SQE.
/// The producer increments this AFTER writing SQE data to the ring.
/// The doorbell message carries this sequence number.
/// Memory ordering: producer uses Release, consumer uses Acquire.
pub producer_seq: AtomicU64,
/// Reserved for future use (cache line padding).
_pad_producer: [u8; 56],
// === Consumer-owned cache line (written locally) ===
/// Sequence number up to which the consumer has processed.
/// Used for flow control and to detect dropped entries.
pub consumer_seq: AtomicU64,
/// Reserved for future use (cache line padding).
_pad_consumer: [u8; 56],
}
Protocol: Sequence-Based Doorbell Synchronization
The core challenge is that RDMA Write and RDMA Send operations may arrive at the consumer out of order — the doorbell (RDMA Send) could arrive before the SQE data (RDMA Write) is visible in memory. To solve this, we use a sequence counter that the consumer polls to determine data readiness:
Producer (Node A) submits work:
1. Write SQE to local staging buffer
2. RDMA Write: push SQE to consumer's ring memory on Node B
(one-sided, no CPU involvement on Node B)
3. RDMA Write: update producer_seq in consumer's RdmaRingHeader.
The new value is (previous_seq + 1).
On RC (Reliable Connection) QPs, RDMA Writes are guaranteed to be delivered
and executed in posting order at the responder. Therefore, the SQE data
(step 2) is always visible before producer_seq (step 3), without
requiring any additional fencing.
Note: IBV_SEND_FENCE is NOT needed here — it only orders operations after
prior RDMA Reads and Atomics, not after prior Writes.
(Each producer has a dedicated QP per consumer, so per-QP ordering suffices.)
4. If consumer was idle: RDMA Send doorbell with payload = new producer_seq value.
The doorbell serves only as a notification to wake the consumer; the consumer
does NOT trust the doorbell's sequence value directly. Instead, the consumer
reads producer_seq from local memory (where it was written by step 3) with
Acquire ordering to establish the happens-before relationship.
Consumer (Node B) processes work:
1. Receive doorbell notification (RDMA Send) OR poll timeout
2. Read producer_seq with Acquire ordering from local RdmaRingHeader
3. Compare producer_seq to consumer_seq to determine how many entries are ready:
ready_count = producer_seq - consumer_seq
4. For each ready entry:
a. Read SQE from local ring memory (RDMA write from step 2 already placed it there)
b. Process request
c. Write CQE to local completion ring
5. Update consumer_seq with Release ordering after processing
6. If completions generated: RDMA Write CQEs back to producer's ring on Node A
Latency breakdown:
RDMA Write (SQE, 64 bytes): ~1 μs
RDMA Write (producer_seq): ~0.5 μs (RC QP guarantees Write-after-Write ordering;
no IBV_SEND_FENCE needed — see note above)
RDMA Send (doorbell): ~0.5 μs (may arrive before or after step 3;
consumer uses seq polling to handle either order)
Consumer processing: application-dependent
RDMA Write (CQE, 32 bytes): ~1 μs
Total overhead: ~3 μs round-trip (plus processing time; +0.5 μs vs. non-seq approach
for the producer_seq write, but eliminates the race condition)
Why sequence-based synchronization is required:
RDMA Send and RDMA Write are different verb types. While RDMA Writes are ordered within a single QP, the relationship between RDMA Send and RDMA Write is not guaranteed by the InfiniBand specification. A naive protocol that sends the doorbell after writing SQE data could observe:
Timeline (broken protocol):
Producer: Write SQE --(network)--> Consumer receives SQE in memory
Producer: Send doorbell --(network)--> Consumer receives doorbell interrupt
If the doorbell packet arrives first (different routing, less data to transfer),
the consumer's interrupt handler reads the ring before the SQE RDMA Write has
arrived — it sees stale or uninitialized data.
The sequence counter solves this by making the data visibility indication itself traveled via RDMA Write (which is ordered with respect to the SQE data writes). The doorbell is merely an optimization to avoid spinning; correctness depends only on the producer_seq field, which the consumer reads with Acquire ordering.
Memory ordering semantics:
| Operation | Ordering | Rationale |
|---|---|---|
| Producer writes SQE data | Relaxed | No ordering requirement until seq is published |
| Producer writes producer_seq | Release | Ensures SQE data is visible before seq advances |
| Consumer reads producer_seq | Acquire | Ensures seq read happens before SQE data read |
| Consumer writes consumer_seq | Release | Ensures SQE processing completes before ack |
| Producer reads consumer_seq | Acquire | Ensures ack read happens before reusing slots |
On x86-64, Release/Acquire compile to plain MOV instructions (TSO provides the required ordering). On AArch64, RISC-V, and PowerPC, the compiler emits the appropriate barriers (STLR/LDAR, fence-qualified atomics, lwsync/isync).
47.4.4 Batching and Coalescing
For high-throughput scenarios (e.g., database replication, event streaming):
/// Batch submission: write multiple SQEs in a single RDMA operation.
/// Amortizes RDMA overhead across N entries.
pub struct RdmaBatchSubmit {
/// Number of SQEs to submit in this batch.
pub count: u32,
/// Maximum time to wait for batch to fill before flushing (μs).
pub max_coalesce_us: u32,
/// Minimum batch size before flushing.
pub min_batch_size: u32,
}
// With batching:
// Single SQE: ~3 μs overhead per entry
// Batch of 64 SQEs: ~5 μs total = ~78 ns per entry (38x improvement)
47.5 Distributed Shared Memory
47.5.1 Design Overview
Distributed Shared Memory (DSM) allows processes on different nodes to share a virtual address space. Pages migrate between nodes on demand, using RDMA for transport and page faults for coherence.
Why previous DSM projects failed and why this one can work:
| Past Problem | Our Solution |
|---|---|
| Cache-line coherence (64B) over network | Page-level coherence (4KB minimum) |
| Bolted onto Linux MM (invasive patches) | Designed into ISLE MM from start |
| No hardware fault mechanism for remote | RDMA + CPU page fault = standard demand paging |
| Software TLB shootdown over network | Targeted TLB shootdown via RDMA notification (only invalidating nodes listed in the sharer set) |
| Single writer protocol (slow) | Multiple-reader / single-writer with RDMA invalidation |
| No topology awareness | Cluster distance matrix drives all placement decisions |
| Catastrophic on node failure | Capability-based revocation + replicated directory |
47.5.2 Page Ownership Model
Every shared page has exactly one owner node and zero or more reader nodes:
// isle-core/src/dsm/ownership.rs
/// Ownership state of a distributed shared page.
#[repr(u8)]
pub enum DsmPageState {
/// Page is exclusively owned by this node. No remote copies exist.
/// This node can read and write freely.
Exclusive = 0,
/// Page is owned by this node, but read-only copies exist on other nodes.
/// To write: must first invalidate all reader copies.
SharedOwner = 1,
/// This node has a read-only copy. Owner is elsewhere.
/// To write: must request ownership transfer from current owner.
SharedReader = 2,
/// Page is not present on this node. Owner is elsewhere.
/// To read or write: fault → request page from owner via RDMA.
NotPresent = 3,
/// Page is being transferred (migration in progress).
Migrating = 4,
/// Ownership transfer is in progress: the directory entry has been updated to
/// reflect the new owner, but invalidations to current readers have not yet
/// completed. Nodes that read the directory entry during this window see
/// `Invalidating` and must spin-wait (via the seqlock protocol) until the
/// state transitions to `Exclusive`, indicating all invalidations are acked.
Invalidating = 5,
}
/// Directory entry for a distributed shared page.
/// Stored on the home node (determined by hash of virtual address).
#[repr(C)]
pub struct DsmDirectoryEntry {
/// Sequence counter for local CPU consistency on the home node.
/// DsmDirectoryEntry is larger than 8 bytes, so concurrent reads/writes on the
/// home node's CPU require a seqlock.
/// Protocol (home node CPU only — NOT for RDMA):
/// Writer (equivalent to Linux `write_seqlock()` / `write_sequnlock()`):
/// 1. Spin-wait until `sequence` is even (unlocked).
/// 2. CAS(sequence, even_value, even_value + 1, Acquire) — acquires exclusive
/// access AND starts the seqlock write section (makes sequence odd). If
/// CAS fails (another writer won the race or a concurrent increment), retry
/// from step 1. The CAS provides mutual exclusion: only one writer can
/// transition a given even value to odd.
/// 3. Modify fields.
/// 4. store(sequence, even_value + 2, Release) — releases exclusive access AND
/// completes the seqlock write section (makes sequence even, advanced by 2
/// total from the original value). The Release ordering ensures all field
/// updates are visible before the sequence becomes even ("consistent").
/// The CAS in step 2 provides spinlock semantics (mutual exclusion among writers).
/// The even→odd→even transitions provide seqlock semantics for readers. No
/// separate spinlock field is needed.
/// Reader: read sequence (Acquire) → read fields → re-read sequence (Acquire);
/// retry if mismatch/odd.
///
/// **Writer serialization**: The CAS-based seqlock writer protocol (step 2 above)
/// inherently provides mutual exclusion — only one writer can succeed at the
/// CAS(even → odd) transition. No separate per-entry spinlock is needed. This is
/// equivalent to Linux's `write_seqlock()` which combines the spinlock acquisition
/// and sequence increment into a single atomic operation. The CAS uses Acquire
/// ordering to prevent subsequent field writes from being reordered before the
/// sequence becomes odd. The final store uses Release ordering to ensure all field
/// updates are visible before the sequence becomes even.
///
/// Remote nodes do NOT read directory entries via one-sided RDMA Read — a seqlock
/// cannot work across separate RDMA operations (no atomicity between reads).
/// Instead, remote nodes send a two-sided directory lookup request to the home
/// node (see Section 47.5.4 page fault flow), and the home node's CPU reads the
/// entry locally (where the seqlock works correctly) and returns it.
pub sequence: AtomicU64,
/// Current owner of this page.
pub owner: NodeId,
/// Nodes that have read-only copies.
/// Bitfield: bit N = node N has a copy.
/// Supports up to MAX_CLUSTER_NODES (64) nodes; a u64 bitmask enables
/// efficient set operations (add/remove reader in O(1) via bit set/clear).
pub readers: u64,
/// Page state on the owner node.
pub state: DsmPageState,
/// Version counter (incremented on every ownership transfer).
/// Used for consistency checks and stale-copy detection.
pub version: u64,
/// Membership epoch when this entry's home assignment was last computed.
/// When cluster membership changes (node join/leave), the modular hash
/// assignment is recomputed and directory entries are reassigned to new home nodes.
/// This field records the epoch of the last rebalance so that stale entries
/// from a previous epoch can be detected and re-homed.
pub rehash_epoch: u32,
/// Wait queue for blocking on `Invalidating` → `Exclusive` transitions.
/// Request handlers that encounter `state == Invalidating` block here instead
/// of spin-waiting. The wait queue is woken when the invalidation completes.
/// This field is NOT transmitted over RDMA — it is local to the home node's
/// kernel memory and does not affect the wire format of directory entries.
/// Size: ~16-32 bytes depending on architecture (typically a spinlock + list head).
pub wait_queue: WaitQueueHead,
}
Note: The
u64bitfield limits clusters toMAX_CLUSTER_NODES(64) nodes, matching the design constant defined in Section 47.2.4. This limit is chosen because a u64 bitmask enables O(1) reader set operations (add, remove, membership test via bit manipulation) and fits in a single atomic operation. For larger clusters, extend to[u64; N]or use a compact set representation. The 64-node limit covers the target deployment (rack-scale RDMA clusters). Extension is deferred (see Section 47.14.6).RDMA pool constraint: DSM pages are allocated from the RDMA-registered memory pool (Section 47.3.3), which defaults to 25% of physical RAM (configurable via
/sys/kernel/isle/cluster/rdma_pool_percent, range 1-75%). Only pages within the RDMA pool can be transferred to or fetched from remote nodes. If the pool is exhausted, remote page faults block until pages are freed via LRU-based eviction (the DSM eviction policy reclaims the least-recently-accessed remote-resident pages first). The 25% default is configurable per deployment; workloads with large distributed working sets should increase the pool size accordingly.
47.5.3 Home Node Directory
Each shared page has a home node determined by hashing its virtual address. The home node stores the authoritative directory entry (who owns the page, who has copies). This avoids a centralized directory server.
Directory indexing data structure: The home node stores directory entries in a
per-region radix tree indexed by virtual page number within the region. The radix tree
uses 9-bit fan-out (512 entries per node, matching page table structure) with
RCU-protected reads and per-node spinlocks for modification. For a 1TB region with
4KB pages (268M entries), the radix tree uses approximately 4 levels with ~524K internal
nodes (~32MB of metadata). Mutual exclusion for directory entry writes is provided by
the CAS-based seqlock writer protocol defined on DsmDirectoryEntry::sequence: the
CAS(even → odd) step provides spinlock semantics (only one writer can succeed), and the
even→odd→even transitions provide seqlock semantics for readers. This eliminates the
need for a separate per-entry spinlock — the sequence field's CAS-based writer
protocol inherently serializes writers (see DsmDirectoryEntry writer serialization
note in Section 47.5.3).
Page with virtual address VA:
home_node = hash(dsm_region_id, VA) % cluster_size
The home node might not be the owner or have a copy.
It just maintains the directory entry.
Why hash-based:
- No single point of failure (every node is home for some pages)
- O(1) lookup (no traversal)
- Uniform distribution across nodes (modular hashing)
- If home node fails: rehash to backup (Section 47.11)
Note: This is modular hashing (hash % cluster_size), NOT consistent hashing.
Modular hashing remaps most entries when cluster_size changes. ISLE targets
fixed-membership clusters where node join/leave is a rare, coordinated event.
When membership changes, directory rehash uses incremental migration:
1. Quorum leader announces new cluster_size to all nodes.
2. Each home node computes which of its directory entries map to a
different home node under the new hash. These entries are marked
"migrating" but remain readable at the old location.
3. New entries are placed in the new hash location immediately.
4. Existing entries are migrated lazily on access: when a node receives
a directory lookup for an entry it no longer owns (under the new hash),
it forwards the request to the new home node. If the new home node
does not yet have the entry, the old home node transfers it on demand.
5. A background sweep migrates remaining entries at low priority.
6. The directory maintains both old and new hash functions during the
migration window. No stop-the-world pause is required. The migration
duration depends on cluster size and DSM dataset size: for typical
clusters (8-32 nodes, <100GB DSM), the background sweep completes in
~1-5 seconds. For large DSM datasets (e.g., 1TB, ~256M directory entries),
migration of all entries takes longer — potentially minutes if entries are
not accessed during the sweep (each migration requires an RDMA round-trip
of ~3μs). The system remains fully functional during this window via the
dual-hash lookup (old location serves requests until migration completes).
7. After all entries are migrated, the old hash function is retired.
In-flight page faults during rehash: A page fault that arrives at a node which
is no longer the home node (under the new hash) is handled by forwarding:
- The old home node checks if it still has the directory entry locally.
If yes, it services the request directly (the entry hasn't migrated yet).
- If the entry has already migrated, the old home node returns a REDIRECT
response with the new home node ID. The faulting node retries the lookup
at the new home node. At most one redirect per fault in the common case
(the new hash is deterministic, so the second lookup always reaches the
correct node). During active entry migration, the fault handler may spin
for up to 8 retries before the redirect (see write-fault race below),
bounding worst-case latency to ~8μs + one redirect.
- Ownership transfers in progress (Section 47.5.4 steps 3-8) that span the
rehash boundary complete under the old hash. The entry is migrated to the
new home node after the transfer completes. This is safe because the
old home node holds the entry until migration, and the seqlock prevents
concurrent modification.
- **Concurrent rehash**: If a second membership change occurs while a rehash is
in progress, the first rehash is completed before the second begins (the
**quorum leader** — defined as the lowest-ID node in the current majority
partition — serializes membership change announcements). During the first
rehash's 1-5 second migration window, the quorum leader holds a membership-change lock
that prevents new node join/leave from being processed. New membership events
are queued and processed sequentially after the current rehash completes. This
ensures that at most one hash transition is active at any time, preserving the
"at most one redirect" guarantee.
**Failure during rehash**: If a node FAILS during a rehash, the failure
is handled by the existing membership protocol (Section 47.11) — the dead
node's directory entries are redistributed as part of the rehash, not as a
separate event. If the **quorum leader** dies while holding the membership-change
lock, the new quorum leader is deterministically identified as the lowest-ID
surviving node in the majority partition (Section 47.11.3). The new
leader inherits the in-progress rehash by querying all surviving nodes for their
current hash function version (old vs. new). It then either completes the rehash
(if >50% of entries have migrated) or aborts it (rolling back to the old hash
function). The membership-change lock is not a distributed lock — it is a logical
role held by the quorum leader, so quorum-leader reassignment implicitly transfers it.
Write-fault race during directory entry migration: A write fault on a page
whose directory entry is currently being moved to a new home node (step 4
above — lazy on-demand transfer from old to new home node) requires care:
- Each directory entry carries a per-entry rehash_epoch field (u32) that
is incremented when the entry is tagged "migrating to new home" (step 2).
- The fault handler reads the rehash_epoch before and after reading the
directory entry (as part of the existing seqlock protocol). If the epoch
has changed, or if the entry's state is "migrating", the fault handler
treats this as a transient condition and retries the directory lookup after
a short spin (up to 8 retries with 1 μs backoff, then falls back to the
REDIRECT path).
- Alternatively, the home node may hold a per-entry spinlock during the
entry-transfer step (old home → new home). The fault handler acquiring the
same lock before reading the entry ensures it either sees the entry fully
present (pre-transfer) or receives a REDIRECT (post-transfer), with no
window where the entry is partially visible.
Both mechanisms are equivalent in correctness; the epoch/retry approach is
preferred because it avoids blocking the fault path on a lock acquisition.
This is fundamentally different from DLM's consistent hashing (Section 31a.4
in 07-storage.md), which uses a hash ring for minimal redistribution because
lock master reassignment must be fast and non-disruptive.
47.5.4 Page Fault Flow
Process on Node A reads address VA (not present locally):
1. CPU page fault on Node A.
2. ISLE MM handler identifies VA as part of a DSM region.
3. Compute home_node = hash(region, VA) % cluster_size = Node C.
4. RDMA Send to home Node C: "Lookup directory entry for VA."
(Two-sided — the home node's CPU reads the entry locally using the seqlock
and returns it. One-sided RDMA Read cannot be used because the DsmDirectoryEntry
is larger than 8 bytes and a seqlock requires atomic re-reads, which separate
RDMA Read operations cannot provide. Round-trip: ~4-5 μs.)
Node C's request handler reads the directory entry under the seqlock. If the
entry's state is a transient state, the handler resolves it locally before
responding:
- **state == Invalidating**: A concurrent write-fault is invalidating readers.
The handler **blocks on a per-entry wait queue** (see "Invalidating state
blocking mechanism" below) until the state transitions out of Invalidating
(typically to Exclusive, once all invalidation acks arrive). The requesting
node (A) never sees the Invalidating state — the home node resolves it before
replying. The handler does NOT spin-wait; spinning for up to 1000ms (the
membership dead timeout) would block the CPU and could cause priority inversion
or deadlock in the request handler thread pool.
- **state == Migrating**: The page's home assignment is being transferred to a
different node due to a membership change. The handler returns EAGAIN. Node A
retries the directory lookup after a brief backoff (10-50μs), by which time
the migration has typically completed and the new home node is authoritative.
5. Directory says: owner = Node B, state = Exclusive (or SharedOwner).
6. RDMA Send to Node B: "Request read copy of page at VA."
(Two-sided, Node B's kernel handles the request.)
7. Node B's handler:
a. Transitions page state from Exclusive → SharedOwner.
b. RDMA Write: sends 4KB page to Node A.
c. RDMA Send: Node B sends directory update request to Node C (add Node A
to readers). Node C's CPU processes the request locally — updating the
`readers` bitfield, `state`, and `version` fields atomically using the
local seqlock. (~4-5 μs round-trip.)
8. Node A receives page:
a. Installs page in local page table (read-only mapping).
b. Sets local DsmPageState = SharedReader.
c. Resumes faulting process.
Total latency: ~10-18 μs (directory lookup ~4-5 μs + ownership request ~3-5 μs
+ page transfer ~3-5 μs + local install ~1 μs)
Compare: NVMe page fault = ~12-15 μs (comparable)
Process on Node A writes to address VA (has read-only copy):
1. CPU page fault (write to read-only page) on Node A.
2. ISLE MM handler identifies DSM page with SharedReader state.
3. Request ownership transfer:
a. RDMA Send to home Node C: "Request exclusive ownership of VA."
b. Node C looks up directory: owner = Node B, readers = {A, D}.
Node C transitions the directory entry to `Invalidating` state (setting
`owner = Node A` (the requester), `state = DsmPageState::Invalidating`)
under the seqlock before sending any invalidations. This ensures that any
concurrent directory lookup during the invalidation window sees the
`Invalidating` state and waits (see read-fault flow step 4).
c. Node C sends invalidation to all readers except requester:
- RDMA Send to Node D: "Invalidate your copy of VA."
- Node D unmaps page, flushes TLB, sends ack to Node C.
d. After all sharer acks received:
**Case 1: home != owner (C != B) — normal forwarding:**
- Node C sends ownership transfer request to Node B:
"Transfer ownership of VA to Node A."
- Node B unmaps page, sends page data directly to Node A via RDMA Write.
- Node B sends ack to Node C confirming transfer complete.
- Node B transitions to NotPresent.
**Case 2: home == owner (C == B) — skip forwarding:**
- The home node IS the current owner. No forwarding message is needed.
The home node directly invalidates its own local copy, prepares the
page data, and sends it to Node A via RDMA Write. This avoids
self-messaging (which would deadlock or cause infinite forwarding
if the home node's request handler sends a message to itself).
- Node C (the home/owner) unmaps its own page, transitions to NotPresent.
- **Concurrency model (applies to both Case 1 and Case 2)**: The directory
entry is transitioned to `Invalidating` at step 3b (before any invalidation
messages are sent), using the CAS-based seqlock writer protocol for mutual
exclusion. Other request handlers that read the directory entry during the
invalidation window see the `Invalidating` state and **block on the
per-entry wait queue** (see "Invalidating state blocking mechanism" below)
until the state transitions to `Exclusive` (indicating all invalidation
acks have been received and the transfer is complete). This prevents
stale-ownership reads: no node can observe the old owner in the directory
entry while invalidations are in flight. Invalidation requests to readers
are sent after the seqlock write section completes (sequence returns to
even), allowing concurrent directory reads to see the `Invalidating` state.
After all acks arrive, the handler re-acquires the seqlock, transitions
the state from `Invalidating` to `Exclusive` at step 3f, and **wakes all
waiters** on the per-entry wait queue. This prevents deadlock because
other directory operations on the same home node can proceed while waiting
for invalidation acks (waiters are descheduled, not spinning). Additionally,
invalidation ack processing on a reader node does NOT require any directory
operation — it only involves local page table manipulation and an RDMA
Send reply. Therefore, invalidation acks cannot create circular lock
dependencies.
e. Node A confirms receipt of page data by sending ack to Node C.
f. Node C updates directory: owner = Node A, readers = {}, state = Exclusive.
g. Node C sends grant to Node A (RDMA Send): "You now own VA exclusively."
4. Node A receives exclusive ownership:
a. Installs page with read-write mapping.
b. Resumes faulting process.
Total latency: ~15-25 μs (involves invalidating remote copies)
Home==owner optimization saves one RDMA round-trip (~3-5 μs) in that case.
Invalidation ack timeout with escalation: Each invalidation request sent by the home node in step 3c carries a 200 μs timeout. The home node does NOT proceed with the ownership transfer until all readers have acknowledged invalidation or been removed from the reader set — doing so would violate coherence (the stale reader could read data that the new exclusive owner has since modified). If a reader does not acknowledge within 200 μs, the home node escalates:
- Retry (up to 3 attempts, 200 μs apart): The home node re-sends the invalidation request. A live but slow reader (e.g., handling a long interrupt or scheduling delay) will respond on retry.
- Suspect (after 600 μs total): The home node reports the non-responding reader to the membership protocol (Section 47.11) as suspect. If the node is genuinely unreachable, the membership protocol will mark it Suspect after 3 missed heartbeats (300 ms) and Dead after 10 missed heartbeats (1000 ms, per Section 47.11.2), at which point its reader bit is cleared from all directory entries.
- Proceed after removal: Once the reader has either acknowledged the invalidation or been marked Dead by the membership protocol (at which point its reader bit is cleared from the directory entry), the ownership transfer proceeds. The write-faulting thread blocks during escalation but is guaranteed forward progress — either the reader responds or the membership protocol eventually marks it Dead and removes it.
This ensures strict coherence: no node can hold a read-only mapping while another node holds exclusive ownership. The worst-case latency for a write fault with an unresponsive reader is ~1000 ms (the membership dead timeout: 10 missed heartbeats at 100 ms intervals, per Section 47.11.2), compared to Linux's 10-30 second fencing delay. In the common case (all readers responsive), the additional cost is zero — the 200 μs timeout never fires.
Maximum Invalidating state duration: A directory entry remains in the Invalidating
state from step 3b (when the home node sets it) until step 3f (when all invalidation
acks arrive and the state transitions to Exclusive). The maximum duration is bounded
by the invalidation ack timeout escalation above: 200 μs initial timeout, up to 3
retries (600 μs), then escalation to the membership protocol which marks an
unresponsive node Dead after 1000 ms (10 missed heartbeats at 100 ms intervals, per
Section 47.11.2). At that point, the home node removes the unresponsive node from the
sharer set and proceeds with the state transition. Therefore, the worst-case
Invalidating duration is bounded by the membership dead timeout (~1000 ms). Any
concurrent read-fault that arrives at the home node during this window will block
on the per-entry wait queue (see read-fault flow step 4) until the state resolves.
The waiters are descheduled during this time, not spinning, allowing the CPU to
process other requests.
Invalidating state blocking mechanism: Each DsmDirectoryEntry includes a
per-entry spinlock (entry_lock: SpinLock) and a kernel wait queue
(wait_queue: WaitQueueHead) for blocking on state transitions. The seqlock
(sequence field) remains the fast-path mechanism for read-side directory lookups,
but the blocking/wakeup path uses a separate spinlock because seqlocks cannot be
held across schedule() (the seqlock write side disables preemption internally,
and sleeping with preemption disabled is a deadlock).
When a request handler encounters the Invalidating state:
- The handler acquires
entry_lock(per-entry spinlock). - The handler re-checks the state under the spinlock. If no longer
Invalidating, release the spinlock and proceed (avoids unnecessary sleep). - The handler calls
prepare_to_wait(&entry.wait_queue, TASK_INTERRUPTIBLE)— this adds the handler to the wait queue while holding the spinlock, preventing the race where a waker callswake_up_all()between the lock release and the wait queue addition. - The handler releases
entry_lock. - The handler calls
schedule(), which deschedules the thread. If a concurrent wakeup occurred between steps 3 and 5,schedule()returns immediately without sleeping (the standard Linux wait pattern). - On wakeup, the handler calls
finish_wait(&entry.wait_queue)and re-reads the directory entry under the seqlock (optimistic read path) to proceed.
When the ownership transfer completes (step 3f), the invalidating handler:
- Acquires the seqlock writer (CAS even→odd, provides writer serialization).
- Transitions state from Invalidating to Exclusive.
- Releases the seqlock writer (write even).
- Acquires entry_lock, calls wake_up_all(&entry.wait_queue), releases
entry_lock.
Wait queue safety invariant: The wait queue is protected by entry_lock (a
per-entry spinlock), NOT by the seqlock. Waiters add themselves to the queue under
entry_lock (step 3), and wakers hold entry_lock while calling wake_up_all().
The seqlock writer and entry_lock are independent — the seqlock serializes
directory entry updates (state transitions), while entry_lock serializes wait
queue operations. Lock ordering: seqlock writer THEN entry_lock (never reversed).
This blocking mechanism ensures that the home node can process thousands of
concurrent DSM requests without CPU-starving due to spin-waits. The wait queue
is per-entry (not global), so blocking on one page's invalidation does not
affect other pages. The wait queue head is allocated lazily — the
DsmDirectoryEntry stores only a pointer (wait_queue: *mut WaitQueueHead,
8 bytes) that is null until the first blocking wait occurs. Most entries never
experience contention, so the common case pays only 8 bytes of pointer overhead
per entry rather than embedding a full 16-32 byte wait queue head in every entry.
RDMA operation ordering: Multi-step DSM operations (send page data, then update
directory) require explicit acknowledgment to guarantee ordering across different QPs.
IBV_SEND_FENCE only orders operations within a single QP, but page data goes to the
page recipient while the directory update goes to the home node -- these are different
QPs on different remote nodes. The correct protocol uses an explicit ack step:
1. Owner sends page data to requester via RDMA Write (on QP to requester).
2. Requester confirms receipt by sending an RDMA Send ack to the home node.
3. Home node updates the directory via local CAS (no cross-QP ordering needed since
the home node's CPU processes the ack before updating its own local directory).
This adds one extra round-trip (~2 us) compared to a naive fence-based approach, but
is required for correctness because RDMA fencing cannot provide cross-QP ordering.
Security requirement — Invalidation ACK authentication: Invalidation ACKs
MUST be authenticated to prevent spoofing by malicious nodes. Without authentication,
a malicious node could send forged invalidation ACKs to the home node, causing it to
prematurely transition the directory entry to Exclusive while a stale reader still
holds a mapping — violating cache coherence and potentially leaking data. Authentication
requirements:
- Each invalidation request from the home node (step 3c) MUST include a
cryptographically random 128-bit
invalidation_noncegenerated fresh for each invalidation batch. - The invalidation ACK from each reader node MUST include:
- The same
invalidation_nonce(proving the ACK is a response to this specific invalidation request, not a replay) - An HMAC-SHA-256 computed over
{nonce || page_va || reader_node_id}using a session key established during cluster join (Section 47.2.3, derived via X25519 Diffie-Hellman and HKDF-SHA256 from the shared secret) - The home node MUST verify the HMAC before accepting the ACK. Invalid or missing HMACs MUST be treated as if the ACK was never received (triggering timeout and escalation to membership protocol).
- To limit overhead, the session key is established once during cluster join and rotated every 24 hours. Key rotation uses the authenticated control channel (Section 47.2.3).
- ACKs received after the directory state has already transitioned to
Exclusive(duplicate ACKs due to network reordering) MUST be silently discarded without error — the nonce lookup will fail since the invalidation batch is complete.
Performance impact: HMAC-SHA-256 verification adds ~500 ns to ACK processing on the home node. This is negligible compared to the ~200 μs timeout window. The nonce prevents replay attacks without requiring per-ACK signatures.
47.5.5 Extending PageLocationTracker
The PageLocation enum (Section 43.1.5) includes RemoteNode, RemoteDevice, and
CxlPool variants for distributed memory tracking. The following shows the distributed
variants and their semantics:
// Distributed variants of PageLocation (defined canonically in Section 43.1.5)
pub enum PageLocation {
// ... existing variants (CpuNode, DeviceLocal, Migrating, etc.) ...
/// Page is in CPU memory on this NUMA node (existing).
CpuNode(u8),
/// Page is in accelerator device-local memory (existing).
DeviceLocal {
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is being transferred (migration in progress).
/// Consistent with DsmPageState::Migrating = 4 (the canonical definition above).
/// **Canonical definition**: Section 43.1.5 in 11-accelerators.md defines the
/// Migrating variant with a side-table index (`migration_id: u32`) to keep
/// `PageLocation` at 24 bytes. The `MigrationRecord` side table (defined in
/// Section 43.1.5, stored in `PageLocationTracker::active_migrations`) holds
/// the full source/target details (source_kind, source_node, source_device,
/// source_addr, target_kind, target_node, target_device, target_addr).
Migrating {
migration_id: u32,
},
/// Page is not yet allocated (existing).
NotPresent,
/// Page is in compressed pool (existing).
Compressed,
/// Page is in swap (existing).
Swapped,
// === New: distributed memory locations ===
/// Page is on a remote node's CPU memory, accessible via RDMA.
RemoteNode {
node_id: NodeId,
remote_phys_addr: u64,
dsm_state: DsmPageState,
},
/// Page is on a remote node's accelerator memory (GPUDirect RDMA).
RemoteDevice {
node_id: NodeId,
device_id: DeviceNodeId,
device_addr: u64,
},
/// Page is in CXL-attached memory pool (hardware-coherent).
CxlPool {
pool_id: u32,
pool_offset: u64,
},
}
Security requirement — CXL pool bounds validation: The kernel MUST validate
pool_id and pool_offset before using them to access memory. An out-of-bounds
pool_id or pool_offset could allow unauthorized access to memory outside the
intended pool, potentially exposing kernel data or allowing privilege escalation.
Validation requirements:
pool_idMUST be validated against the global CXL pool registry (cxl_pool_count). Accessing a non-existent pool MUST fail withEINVAL.pool_offsetMUST be validated against the target pool's size (cxl_pools[pool_id].size_bytes). Accesses beyond the pool boundary MUST fail withEFAULT.- For RDMA-initiated CXL pool access, the remote node's capability (Section 47.9)
MUST authorize the specific
pool_id. A capability granting access to pool 0 MUST NOT be usable to access pool 1. - CXL pool resize is grow-only — pools can be expanded but never shrunk. This
eliminates the TOCTOU race where a pool shrinks between the bounds check and the
CPU load/store instruction (seqlocks cannot prevent this race because CPU memory
accesses are not rollback-capable). The pool's
size_bytesis read withAcquireordering; a concurrent grow only increases the valid range, so a stale (smaller) size produces a conservative bounds check, never an out-of-bounds access. Pool deallocation (destroying a pool entirely) requires quiescing all accessors first via the standard RCU grace period mechanism.
47.5.6 DSM Region Management
Distributed shared memory is opt-in. Processes create DSM regions explicitly:
/// Create a distributed shared memory region.
/// All nodes participating in this region can map it.
pub struct DsmRegionCreate {
/// Unique region identifier (cluster-wide).
pub region_id: u64,
/// Virtual address base for this region (must be page-aligned).
/// This is the starting virtual address at which the region will be mapped
/// on all participating nodes. Must not overlap existing mappings.
pub base_addr: u64,
/// Size of the shared region (bytes, page-aligned).
pub size: u64,
/// Page size for this region.
pub page_size: DsmPageSize,
/// Access permissions for this region (read / write / execute bitfield).
/// DSM_PROT_READ = 0x1, DSM_PROT_WRITE = 0x2, DSM_PROT_EXEC = 0x4.
/// Applied uniformly to all mappings on all participating nodes.
pub permissions: u32,
/// Consistency model.
pub consistency: DsmConsistency,
/// Initial placement: which node holds the pages initially.
pub initial_owner: NodeId,
/// Home node assignment policy for directory entries.
/// DSM_HOME_HASH = 0: home_node = hash(region_id, VA) % cluster_size (default).
/// DSM_HOME_FIXED = 1: all directory entries homed on initial_owner (simple, but
/// creates a single hot-spot; use only for small, rarely-accessed regions).
pub home_policy: u32,
/// Access control: which nodes can join this region.
pub allowed_nodes: u64, // Bitfield, limited to MAX_CLUSTER_NODES (64)
/// Capability required to map this region.
pub required_cap: CapHandle,
/// Behavior flags (bitfield).
/// DSM_EAGER_WRITEBACK = 0x1: flush dirty pages to owner on lock release
/// (Section 47.5.8). Default: lazy writeback on eviction only.
/// DSM_REPLICATE = 0x2: enable directory replication for fault tolerance
/// (Section 47.11.4). Default: single home node per entry.
pub flags: u32,
/// Reserved for future extensions; must be zero.
pub _pad: [u8; 4],
}
#[repr(u32)]
pub enum DsmPageSize {
/// Standard 4KB pages. Best for random access patterns.
Page4K = 0,
/// 2MB huge pages. Best for sequential / bulk access.
/// Reduces TLB misses but increases transfer granularity.
HugePage2M = 1,
}
#[repr(u32)]
pub enum DsmConsistency {
/// Release consistency: writes become visible to other nodes
/// after an explicit release (memory barrier / unlock).
/// Best performance. Standard for most HPC/ML workloads.
Release = 0,
/// Sequential consistency is NOT supported. Providing a total order over all
/// memory operations across nodes would require serializing every write through
/// a single ordering point (or using a distributed total-order broadcast), adding
/// ~5-15 μs per write operation even on RDMA. This overhead would negate the
/// performance benefits of DSM for virtually all workloads. Applications requiring
/// sequential consistency should use explicit distributed locking (DLM, Section
/// 31a) or message-passing instead of DSM.
///
/// Reserved for potential future use if hardware (e.g., CXL 3.0 with hardware
/// coherence) provides efficient total ordering.
SequentialReserved = 1,
/// Eventual consistency: writes propagate asynchronously.
/// Lowest overhead. Suitable for read-heavy, stale-tolerant data.
Eventual = 2,
}
47.5.6.1 DSM Region Destruction Protocol
Destroying a DSM region requires coordinated cleanup across all participating nodes:
DSM region destruction for region R:
1. Initiator (any node with the region's destroy capability) sends
REGION_DESTROY(region_id) to all nodes in allowed_nodes bitmask.
2. Each participating node:
a. Unmaps all local pages belonging to region R from process page tables.
b. Flushes TLB entries for region R's address range.
c. For pages where this node is the owner: marks pages as reclaimable
(returned to the local physical page allocator).
d. For pages where this node is a reader: discards the local copy
(no writeback needed — reader copies are clean by the single-writer
invariant).
e. Sends REGION_DESTROY_ACK(region_id, node_id) to the initiator.
3. Initiator collects ACKs from all participating nodes.
Timeout: 5 seconds. Nodes that do not ACK within timeout are assumed
dead — their pages are abandoned (will be reclaimed when those nodes
eventually rejoin or are declared dead via heartbeat).
4. After all ACKs received (or timeout):
a. Initiator sends REGION_DIRECTORY_CLEANUP(region_id) to all nodes
that serve as home nodes for pages in this region.
b. Each home node removes all DsmDirectoryEntry records for region R
from its directory. The backup home node (Section 47.11.3) is also
notified to remove shadow entries.
c. The region_id is retired and cannot be reused for 24 hours
(prevents stale references from delayed messages).
5. Region metadata is removed from /sys/kernel/isle/cluster/dsm/regions.
Error handling:
- If a process still has region R mapped when destruction is initiated,
the mapping is force-unmapped and the process receives SIGBUS on
subsequent access attempts.
- If the initiator crashes during destruction, any node can resume
the protocol by re-sending REGION_DESTROY (idempotent — nodes that
already completed destruction simply re-ACK).
47.5.7 Linux Compatibility Interface
DSM is exposed via standard POSIX shared memory with extensions:
Standard POSIX (works unmodified):
fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
ftruncate(fd, size);
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
→ On a single node, this is standard shared memory. No DSM.
ISLE-specific extension (opt-in):
fd = shm_open("/my_region", O_RDWR | O_CREAT, 0666);
ioctl(fd, ISLE_SHM_MAKE_DISTRIBUTED, &dsm_config);
ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
→ Same mmap'd pointer, but pages can now migrate across nodes.
→ Process on Node B can shm_open("/my_region") and map the same region.
For MPI applications: MPI implementations (OpenMPI, MPICH) can use DSM
regions for intra-communicator shared memory windows, replacing the
current combination of mmap + RDMA + application-level coherence with
kernel-managed coherence.
47.5.8 False Sharing Mitigation
When multiple nodes write to different offsets within the same page, the page bounces between owners ("false sharing"), degrading performance severely.
Detection: The home node tracks per-page ownership transfer frequency. If a page
has more than N ownership transfers per second (default: 100), it is flagged as
contended. The contention count is exposed via the isle_tp_stable_dsm_contention
tracepoint.
Mitigation 1: Advisory — Log contended pages to a tracepoint, allowing the application to restructure its data layout to avoid cross-node false sharing. This is the lowest-overhead option.
Mitigation 2: Whole-page writeback on release — On a write fault to a contended page under the single-writer model (Section 47.5.2), the owning node sends the entire 4KB page to the home node at the next release point (memory barrier or unlock). This is simpler than TreadMarks-style Twin/Diff because the single-writer invariant guarantees only one node modifies the page at a time — there are no concurrent writers whose changes need merging. The home node distributes the updated page to readers on their next fault. Overhead: one 4KB RDMA Write per release (~1-2 μs). This is acceptable because contention mitigation already implies the page is frequently transferred. Note: Twin/Diff (creating a copy before writing, then diffing to find changed bytes) would only be beneficial under a relaxed multiple-writer consistency model. ISLE's DSM uses single-writer, making whole-page transfer the correct and simpler approach.
Mitigation 3: Sub-page coherence for huge pages — For 2MB huge pages that exhibit contention, the kernel falls back to 4KB sub-page coherence for the contended region. The huge page is logically split into 512 sub-pages, each tracked independently in the DSM directory. The physical huge page is preserved (no TLB cost).
Default: Detection + advisory tracing. Eager writeback on release is opt-in per
DSM region via DsmRegionCreate.flags |= DSM_EAGER_WRITEBACK.
47.5.9 Error Handling
Concurrent ownership requests: When two nodes simultaneously request exclusive ownership of the same page, the home node serializes the requests. The first request is processed normally; the second requester is queued and notified when ownership becomes available.
Stale directory: If a node presents an ownership claim with a version counter that doesn't match the directory, the home node detects the stale state and rejects the operation. The requestor retries by re-fetching the directory entry.
Partial transfer: If a page is sent via RDMA Write but the directory update (RDMA Send to home node) fails (e.g., due to a concurrent modification or lost message), the home node detects the version mismatch on the next access and repairs the directory. The transferred page is either adopted (if the transfer was valid) or discarded. Directory updates use two-sided RDMA Send because the DsmDirectoryEntry is 56 bytes (core fields 48 bytes + 8-byte lazy wait queue pointer; too large for 8-byte RDMA Atomic CAS). The home node's CPU processes the update request locally using the seqlock protocol (Section 47.5.4).
Deadlock: If Node A waits for ownership of a page held by Node B, while Node B
waits for a page held by Node A, a deadlock occurs. DSM ownership deadlocks are
resolved via timeout on ownership requests (default: 10ms). On timeout, the requesting
operation is aborted and returns EAGAIN. The application retries with backoff. This
is a deadlock recovery mechanism (timeout-based), not true deadlock detection
(which requires a wait-for graph). True deadlock detection with a distributed wait-for
graph is implemented in the DLM (Section 31a.9 in 07-storage.md), where lock
dependencies are tracked explicitly. DSM uses the simpler timeout approach because page
ownership requests are transient (microsecond-scale) and building a wait-for graph for
every page fault would add unacceptable overhead to the critical path.
47.5.10 Honest Performance Expectations
DSM provides a programming convenience (shared address space), NOT transparent performance parity with local memory. Expectations must be calibrated:
What DSM IS good for:
- Read-mostly workloads with occasional writes (replicated data).
- Coarse-grained partitioned workloads where each node mostly touches
its own partition, with infrequent cross-node access.
- Replacing explicit message-passing in applications where shared memory
is a natural fit and page-level granularity matches access patterns.
What DSM is NOT good for:
- Fine-grained shared data structures (concurrent hash maps, work-stealing
queues). These cause page-level false sharing and ownership bouncing.
Use explicit RDMA messaging or partitioned data structures instead.
- Workloads with random write patterns across the shared space.
Every cross-node write fault costs ~5-50μs (RDMA round-trip + TLB
invalidation), vs ~100ns for local memory. This is 50-500x slower.
- Latency-sensitive paths. DSM page faults are unpredictable.
Performance model (InfiniBand 200Gb/s, ~2μs RTT):
Local page access (TLB hit): ~1 ns
Local page fault (mmap, disk): ~1-100 μs
DSM read fault (page not present): ~10-18 μs (directory lookup ~4-5 μs
+ ownership negotiation ~3-5 μs
+ page transfer ~3-5 μs
+ local install ~1 μs;
see §47.5.4 for detailed flow)
DSM write fault (exclusive ownership): ~10-50 μs (invalidate readers + transfer)
DSM false sharing (bouncing page): ~100+ μs per iteration (pathological)
The mitigations in §47.5.8 help, but cannot eliminate the fundamental cost of network coherence. Applications must be DSM-aware at the data structure level. The kernel's job is to make the common case fast and the pathological case detectable (tracepoints), not to pretend the network is as fast as local memory.
47.5.11 Interaction with Memory Compression (Section 13)
The DSM protocol and the memory compression tier (Section 13) have a potential conflict: when a page is compressed in the local zpool, should it be advertised as "present" or "not present" to the DSM directory?
Design decision: Compressed pages are treated as locally present in the DSM protocol. The decompression cost (~1-3 microseconds via LZ4) is far lower than a network fetch (~5-50 microseconds over RDMA), so it never makes sense to fetch a remote copy of a page that exists locally in compressed form.
Interaction rules:
- Remote node requests a locally compressed page: The owning node decompresses from its zpool, then transfers the uncompressed data via RDMA Write. The zpool entry is freed after transfer.
- DSM migrates a page to a remote node: The sending node decompresses first, then sends the uncompressed 4KB page. The receiving node may independently compress it based on local memory pressure.
- Compressed page metadata is NOT replicated across nodes. Compression is a node-local optimization, invisible to the DSM directory and coherence protocol.
- DSM coherence protocol uses only the
{CpuNode, RemoteNode, NotPresent, Migrating}variants ofPageLocation(Section 43.1.5) for coherence decisions. A "Compressed" DSM state is NOT added — a compressed page is simply "local on CpuNode(N)" from the DSM protocol's perspective. The fullPageLocationTrackertracks all variants (includingDeviceLocal,Compressed,Swapped,RemoteDevice,CxlPool), but compression is transparent to the DSM directory.
Edge case — double fault path: If Node B requests a page from Node A, and Node A has that page compressed in its zpool: (1) Node A receives the RDMA request, (2) local MM subsystem decompresses from zpool (~1-2 microseconds), (3) DSM handler completes the RDMA Write with the decompressed data. The requesting node never knows the page was compressed. Total added latency: ~1-2 microseconds.
47.6 Global Memory Pool
47.6.1 Design: Cluster Memory as a Unified Tier Hierarchy
The memory manager already manages a tier hierarchy:
Current (single-node):
Tier 0: Per-CPU page caches (fastest, smallest)
Tier 1: Local NUMA node DRAM (fast, local)
Tier 2: Remote NUMA node DRAM (cross-socket, ~150ns)
Tier 3: GPU VRAM (Section 43)
Tier 4: Compressed pool (Section 13)
Tier 5: Swap (NVMe SSD)
Extended (distributed cluster):
Tier 0: Per-CPU page caches (unchanged)
Tier 1: Local NUMA node DRAM (unchanged)
Tier 2: Remote NUMA node DRAM, same machine (unchanged)
Tier 3: CXL-attached memory pool (~200-400ns) ← NEW
Tier 4: GPU VRAM (unchanged)
Tier 5: Compressed pool (unchanged)
Tier 6: Remote node DRAM via RDMA (~3-5 μs) ← NEW
Tier 7: Remote node compressed pool via RDMA ← NEW
Tier 8: Local NVMe swap (unchanged)
Tier 9: Remote NVMe via NVMe-oF/RDMA ← NEW
Key insight: Tier 6 (remote RDMA, ~3-5 μs) is faster than Tier 8 (local NVMe).
The global memory pool inserts remote memory as a tier BETWEEN
compressed pages and local swap. Compressed pool (Tier 5, ~1-2 μs
decompression) precedes remote RDMA because local decompression is
faster than a network round-trip.
Note: Tier numbers are ordinal positions in the latency-sorted hierarchy, not
fixed identifiers. When distributed mode adds new tier sources (CXL-attached memory,
remote node DRAM), existing sources shift to higher tier numbers. Code that uses
tier numbers MUST NOT hardcode numeric values — instead, use the TierKind enum
(defined in Section 12.7) to identify tier types by semantic name (e.g.,
TierKind::LocalDram, TierKind::GpuVram, TierKind::DsmRemote) and query
mem::tier_ordinal(kind) for the current ordinal position of each kind.
47.6.2 Memory Pool Accounting
// isle-core/src/mem/global_pool.rs
/// Global memory pool state (cluster-wide view, maintained per-node).
pub struct GlobalMemoryPool {
/// Per-node memory availability.
/// Fixed-capacity array indexed by NodeId; only entries 0..node_count are valid.
/// Uses MAX_CLUSTER_NODES (Section 47.2.4) to avoid heap allocation.
nodes: [NodeMemoryState; MAX_CLUSTER_NODES],
/// Number of active nodes in the cluster.
node_count: u32,
/// Total cluster memory (sum of all nodes).
total_cluster_bytes: u64,
/// Total available for remote allocation (sum of exported pools).
total_available_bytes: u64,
/// Current remote memory usage by local processes.
local_remote_usage_bytes: AtomicU64,
/// Current memory exported to remote nodes.
exported_usage_bytes: AtomicU64,
/// Policy for remote memory allocation.
policy: GlobalPoolPolicy,
}
pub struct NodeMemoryState {
node_id: NodeId,
/// Total physical memory on this node.
total_bytes: u64,
/// Memory available for remote allocation.
/// Admin-configurable: don't export all memory.
/// Default: 25% of total (local workloads get priority).
/// This pool is backed by the RDMA-registered memory region (Section 47.3.3).
/// Only RDMA-registered pages can be served to remote nodes.
export_pool_bytes: u64,
/// Currently allocated to remote nodes.
export_used_bytes: u64,
/// Distance to this node (from cluster distance matrix).
distance_ns: u32,
/// Bandwidth to this node (bytes/sec).
bandwidth_bytes_per_sec: u64,
}
pub struct GlobalPoolPolicy {
/// Maximum percentage of local memory to export for remote use.
/// Default: 25%. Protects local workloads from starvation.
pub max_export_percent: u32,
/// When local memory pressure exceeds this threshold,
/// start reclaiming exported pages (evicting remote users).
/// Default: 80% of local memory usage.
pub reclaim_threshold_percent: u32,
/// Prefer remote memory over local swap?
/// Default: true (RDMA is faster than NVMe).
pub prefer_remote_over_swap: bool,
/// Maximum remote memory a single process can consume (bytes).
/// 0 = unlimited (subject to cgroup limits).
pub per_process_remote_max: u64,
}
Security requirement — Cross-node capability chain validation: Remote memory access via the global memory pool MUST validate capability chains across nodes. A capability issued by Node A that authorizes access to memory on Node A must NOT be usable to access memory on Node B without explicit cross-node authorization. Validation requirements:
- Capability scope validation: When Node A's process accesses memory exported by Node B via the global pool, the kernel MUST verify that:
- The process holds a valid
DistributedCapability(Section 47.9.2) for remote memory access, signed by a trusted issuer - The capability's
constraintsfield explicitly authorizes the targetNodeId(or containsNODE_ID_ANYfor cluster-wide access) - The capability has not expired, been revoked, or had its generation invalidated (standard distributed capability validation)
-
The capability's
permissionsfield includesREMOTE_MEMORY_READand/orREMOTE_MEMORY_WRITEas appropriate for the operation -
Delegation chain integrity: If a capability was derived through delegation (e.g., process P1 on Node A delegates to process P2 on Node B), the kernel MUST verify the entire chain:
- Each capability in the chain MUST have a valid signature from its issuer
- Each delegation MUST have
DELEGATEpermission in the parent capability - Derived capabilities MUST be strictly less powerful than their parent (no permission amplification)
-
The delegation depth MUST NOT exceed
MAX_CAP_DELEGATION_DEPTH(default: 8) to prevent resource exhaustion -
Remote node attestation: Before honoring a cross-node capability, Node B MUST verify that Node A is a current, authenticated cluster member (not revoked, evicted, or marked Dead per Section 47.11). This check is O(1) via the cluster membership bitmap.
-
Audit logging: Cross-node capability validations that fail MUST be logged to the security audit subsystem (Section 17) with:
- Source node ID
- Target node ID
- Capability object_id and generation
- Failure reason (signature invalid, expired, revoked, wrong node, etc.)
47.6.3 The Killer Use Case: AI Model Memory
Cluster: 4 nodes, each 512GB RAM + 4× A100 GPUs (80GB VRAM each)
Total CPU RAM: 2 TB
Total GPU VRAM: 1.28 TB
Total cluster memory: 3.28 TB
Scenario: Run a 405B parameter model (Llama-3.1-405B, ~810GB at FP16)
Without global memory pool (current state of the art):
- Tensor parallelism across 16 GPUs: each GPU holds 1/16 of model (~50GB)
- Model fits in GPU VRAM (80GB per GPU)
- BUT: KV cache for long context (128K tokens) = ~100-200GB additional
- KV cache spills to CPU RAM via UVM → 10-15 μs per page fault to NVMe
- Inference latency: dominated by KV cache spills
With global memory pool:
- GPU VRAM: hot layers (active attention heads, current KV cache entries)
- Local CPU RAM: warm layers (recent KV cache, inactive attention heads)
- Remote CPU RAM (RDMA): cold KV cache entries from other nodes
→ 3-5 μs per page fault (faster than local NVMe!)
- Local NVMe: only for truly cold data (old checkpoints, etc.)
The kernel manages placement transparently:
- MigrationPolicy tracks access patterns per page
- Hot pages migrate toward the accessing GPU/CPU
- Cold pages migrate to remote nodes with more available memory
- The ML framework sees a flat address space; kernel handles the rest
Performance impact:
- KV cache "miss" on remote RDMA: ~5 μs (vs. ~15 μs from NVMe)
- 3x improvement in tail latency for long-context inference
- No application code changes required
47.6.4 Cgroup Integration
/sys/fs/cgroup/<group>/memory.remote.max
# Maximum remote memory this cgroup can consume (bytes)
# Default: "max" (unlimited, subject to global pool policy)
/sys/fs/cgroup/<group>/memory.remote.current
# Current remote memory usage (read-only)
/sys/fs/cgroup/<group>/memory.remote.stat
# Remote memory statistics:
# remote_alloc <bytes allocated on remote nodes>
# remote_faults <page faults resolved from remote>
# remote_migrations_in <pages migrated from remote to local>
# remote_migrations_out <pages migrated from local to remote>
# rdma_bytes_rx <total RDMA data received>
# rdma_bytes_tx <total RDMA data sent>
/sys/fs/cgroup/<group>/memory.tier_preference
# Override default tier ordering for this cgroup:
# "local,cxl,remote,compressed,swap" (default)
# "local,compressed,swap" (disable remote memory for this cgroup)
# "local,cxl,remote,swap" (skip compression, prefer remote)
47.7 Distributed Page Cache
47.7.1 Problem
When Node A reads a file that Node B recently accessed and has cached, the standard approach (NFS/CIFS) refetches from the storage server over TCP. This ignores that Node B already has the data cached in its page cache and could serve it faster via RDMA.
47.7.2 Design: Cooperative Page Cache
The page cache gains awareness of what other nodes have cached:
Traditional NFS read:
Node A: read() → VFS → NFS client → TCP → NFS server → disk → TCP → Node A
Latency: ~200 μs (network + server + disk if not cached on server)
Cooperative page cache (shared filesystem):
Node A: read() → VFS → page cache miss → WHERE is this page?
Option 1: Remote page cache (RDMA read from Node B's page cache)
Latency: ~3-5 μs
Option 2: Local disk
Latency: ~10-15 μs (NVMe)
Option 3: Remote disk (NVMe-oF/RDMA)
Latency: ~15-25 μs
Option 4: Traditional NFS/CIFS
Latency: ~200 μs
Kernel picks the fastest source automatically.
47.7.3 Page Cache Directory
// isle-core/src/vfs/cooperative_cache.rs
/// Distributed page cache directory.
/// Tracks which nodes have which file pages cached.
/// Uses a Bloom filter per file for space efficiency.
pub struct CooperativeCache {
/// Per-file cache presence hints.
/// Key: (filesystem_id, inode_number)
/// Value: per-node Bloom filter of cached page offsets.
/// Uses a SlabHashMap (open-addressing hash map backed by the slab allocator)
/// rather than BTreeMap: O(1) average lookup vs. O(log n) with poor cache
/// locality. The number of tracked files is unbounded and dynamic, so a
/// fixed-size array is not applicable; a hash map is the correct structure.
/// The per-file hint array is fixed at MAX_CLUSTER_NODES (64): one optional
/// entry per node. This avoids heap allocation per file and provides O(1)
/// lookup by NodeId (direct indexing). Entries for absent nodes are `None`.
/// This is not on the page fault hot path — it is consulted on page cache miss,
/// which already involves I/O — but O(1) lookup avoids unnecessary overhead.
hints: SlabHashMap<FileId, [Option<NodeCacheHint>; MAX_CLUSTER_NODES]>,
}
pub struct NodeCacheHint {
node_id: NodeId,
/// Bloom filter of page offsets this node has cached.
/// False positives are OK (just a wasted RDMA read, fallback to disk).
/// False negatives are not OK (would miss a cache hit).
/// 4KB Bloom filter per file per node → ~0.1% FP rate for 10K pages.
bloom: BloomFilter,
/// Last update timestamp (hints expire after 30 seconds).
updated_ns: u64,
}
impl CooperativeCache {
/// Find the best source for a cache page.
pub fn find_page(
&self,
file: FileId,
page_offset: u64,
local_node: NodeId,
distances: &ClusterDistanceMatrix,
) -> PageSource {
// 1. Check local page cache first (always).
// 2. Check Bloom filters: which remote nodes might have this page?
// 3. Of those, pick the closest (lowest distance in matrix).
// 4. If no remote cache hit: fall back to storage.
}
}
pub enum PageSource {
/// Page is in local page cache. No I/O needed.
LocalCache,
/// Page is likely cached on a remote node. Try RDMA read.
RemoteCache { node_id: NodeId, expected_latency_ns: u32 },
/// Page is not cached anywhere. Read from storage.
Storage { device: DeviceNodeId },
/// Page is on remote storage (NVMe-oF/RDMA).
RemoteStorage { node_id: NodeId, device: DeviceNodeId },
}
47.7.4 Cache Coherence for Shared Files
When multiple nodes cache the same file, writes must be coordinated:
Write strategy (per-file, configurable):
1. Write-invalidate (default for shared mutable files):
- Writer acquires exclusive ownership (like DSM write fault)
- All reader copies are invalidated via RDMA
- Writer modifies page, becomes sole cached copy
- Other nodes re-fault on next access (get updated page)
2. Write-through (for append-only logs, databases):
- Writer writes to local page cache AND pushes update to owner
- Owner propagates to all readers via RDMA write
- Higher bandwidth cost, but readers see updates faster
3. No coherence (for read-only data, e.g., shared model weights):
- File is marked immutable (or read-only mounted)
- All nodes cache freely, no invalidation needed
- Best case for ML inference: model weights cached everywhere
Integration with the Distributed Lock Manager (Section 31a):
For clustered filesystems (Section 31), page cache coherence is coordinated through DLM lock operations rather than the DSM coherence protocol. The DLM provides filesystem-aware semantics that the generic DSM protocol cannot:
-
Lock downgrade triggers targeted writeback: When a DLM lock is downgraded from EX to PR (Section 31a.8), only dirty pages tracked by the lock's
LockDirtyTrackerare flushed — not the entire inode's page cache. This eliminates the Linux problem where dropping a lock on a large file requires flushing all dirty pages regardless of how many were actually modified. -
MOESI-like page states: Each cached page on a clustered filesystem carries a coherence state relative to the DLM lock protecting it:
- Modified: Page dirty under EX lock. Sole copy in the cluster.
- Owned: Page was modified, then lock downgraded to PR. This node is responsible for writing back on eviction. Other nodes may have Shared copies.
- Exclusive: Page clean, held under EX lock. Can transition to Modified without network traffic.
- Shared: Page clean, held under PR/CR lock. Read-only.
-
Invalid: Lock released or revoked. Page must be re-fetched on next access.
-
Per-lock-range dirty tracking: The cooperative page cache directory (Section 47.7.3) integrates with Section 31a.8's
LockDirtyTrackerto record which pages were dirtied under which lock range. On lock downgrade, the writeback is scoped to the lock's byte range — concurrent holders of non-overlapping ranges on the same file are not affected.
47.7.5 AI Training Data Pipeline
Training data scenario:
- 100TB dataset on shared NVMe storage
- 8 training nodes, each with 4 GPUs
- Each node reads different shards, but shards overlap (data augmentation)
Without cooperative cache:
Each node reads its shard from storage independently.
If Node A and Node B need the same page: two storage reads.
With cooperative cache:
Node A reads page from storage → cached in Node A's page cache.
Node B needs same page → Bloom filter shows Node A has it.
Node B fetches via RDMA Read from Node A: ~3 μs (vs ~15 μs from NVMe).
Storage bandwidth saved: proportional to shard overlap.
For a typical ImageNet-style dataset with 30% shard overlap:
~30% reduction in storage I/O, ~30% more effective storage bandwidth.
47.8 Cluster-Aware Scheduler
47.8.1 Problem
The scheduler (Section 14) currently optimizes process placement within a single machine: NUMA-aware load balancing, work stealing between CPUs, migration cost modeling. For a distributed kernel, the scheduler should consider the entire cluster.
47.8.2 Design: Two-Level Scheduler
Level 1: Global Cluster Scheduler (runs every ~10s, lightweight)
- Monitors per-node load (CPU, memory, accelerator utilization)
- Decides process-to-node placement
- Triggers process migration when data locality warrants it
- Respects node affinity, cgroup constraints, capability requirements
Level 2: Per-Node Scheduler (existing, runs every ~4ms)
- CFS/EEVDF + RT + DL queues (unchanged)
- NUMA-aware CPU placement (unchanged)
- Accelerator scheduling (Section 42.2.4, unchanged)
47.8.3 Global Scheduler State
// isle-core/src/sched/cluster.rs
pub struct ClusterScheduler {
/// Per-node load summary (updated via periodic RDMA exchange).
/// Fixed-capacity array indexed by NodeId; only entries 0..node_count are valid.
/// Uses MAX_CLUSTER_NODES (Section 47.2.4) to avoid heap allocation.
node_loads: [NodeLoad; MAX_CLUSTER_NODES],
/// Process-to-data affinity map.
/// Tracks which node holds most of a process's working set.
/// Uses a slab allocator keyed by ProcessId for O(1) average lookup;
/// BTreeMap's O(log n) tree traversal with poor cache locality is
/// unsuitable for per-scheduling-tick access under load.
data_affinity: SlabMap<ProcessId, DataAffinity>,
/// Cluster distance matrix (Section 47.2.4).
distances: ClusterDistanceMatrix,
/// Global load balance interval.
balance_interval_ms: u32, // Default: 10000ms (10 seconds)
/// Migration threshold: only migrate if improvement exceeds this.
migration_threshold_ppt: u32, // Default: 300 (300/1000 = 30% locality improvement)
}
pub struct NodeLoad {
node_id: NodeId,
/// CPU utilization 0-100% (average across all CPUs).
cpu_percent: u32,
/// Memory pressure 0-100% (0 = plenty free, 100 = thrashing).
memory_pressure: u32,
/// Accelerator utilization 0-100% (average across all accelerators).
accel_percent: u32,
/// Number of runnable processes.
runnable_count: u32,
/// Remote memory faults per second (high = poor locality).
remote_fault_rate: u32,
}
pub struct DataAffinity {
/// How many pages of this process's working set are on each node.
/// Used to decide where a process should run.
// Fixed array indexed by NodeId; O(1) access, no allocation. MAX_CLUSTER_NODES=64.
pages_per_node: [u64; MAX_CLUSTER_NODES],
/// Total working set size (pages).
total_working_set: u64,
/// Node with most pages (preferred placement).
preferred_node: NodeId,
/// Working set locality on preferred node (0-1000, parts per thousand).
locality_score_ppt: u32,
}
47.8.4 Process Migration
When the cluster scheduler decides a process should move to another node:
Process migration from Node A to Node B:
1. Cluster scheduler on Node A decides: process P should run on Node B.
Reason: 70% of P's working set is on Node B (remote page faults dominate).
2. Pre-migration:
a. Send process metadata to Node B: PID, capabilities, cgroup, open files,
signal handlers, register state.
b. Node B allocates process slot, creates local task struct.
3. Freeze and transfer:
a. Freeze process P on Node A (stop scheduling, save register state).
b. Transfer register state to Node B via RDMA (~64 bytes, ~1 μs).
c. Transfer kernel stack to Node B via RDMA (~16KB, ~2 μs).
d. Mark P's page table on Node A as "migrated to Node B."
Pages are NOT bulk-transferred — they fault in on demand.
4. Resume on Node B:
a. Node B installs process in local scheduler.
b. Process resumes execution on Node B.
c. First memory access → page fault → fetch from Node A via RDMA (~3-5 μs).
d. Subsequent accesses: pages migrate on demand.
e. Working set migrates over ~100ms as pages are faulted in.
Total migration downtime: ~10-50 μs (metadata transfer + freeze/thaw)
Working set follows lazily: pages migrate over seconds as accessed.
This is the same strategy as live VM migration (pre-copy / post-copy),
but at process granularity. Much lighter weight than VM migration.
Full process migration requires transferring the following state:
- Register state: CPU registers, FPU/SIMD state (saved during freeze).
- Kernel stack: The process's kernel-mode stack (~16KB).
- Page table metadata: Transferred lazily — pages fault in on demand from the source node via RDMA.
- Open file descriptors: For local files on the source node, a proxy is created that forwards read/write/ioctl over RDMA IPC to the source. For shared-filesystem files (NFS, CIFS), the descriptor is re-opened locally on the destination.
- IPC handles: SharedMemory transport handles are converted to Rdma transport handles (Section 47.4.2). The ring buffer contents are preserved.
- Cgroup membership: Recreated on the destination node. The destination cgroup must have sufficient quota.
- Signal state: Pending signals, signal mask, and signal handlers are transferred.
- Timer state: Active timers (POSIX, itimer) are recreated on the destination with remaining duration adjusted for clock synchronization (Section 47.11.5).
- Device handles: Local accelerator contexts are converted to
RemoteDeviceProxyhandles (Section 47.10.2). The process continues to use the original device via RDMA-proxied commands.
Not all state is migratable in v1. See Open Questions (Section 47.14.6) for deferred items including io_uring migration, TCP connection migration, namespace handling, and ptrace state.
Process migration scope (v1):
- In scope: CPU register state, page tables (lazy), local file proxying, IPC handle
conversion, cgroup membership, signal state, timer state, accelerator handle proxying.
- Out of scope for v1: Active TCP/UDP connections (must be re-established), active
RDMA queue pairs (application must handle), hardware-specific state (GPU VRAM contents,
NIC offload state), processes with CLONE_VM threads (thread group migration deferred).
- Limitation: A process holding hardware resources that cannot be proxied (e.g.,
direct GPU rendering context with bound VRAM) will fail migration with -EOPNOTSUPP.
Such processes should set CLUSTER_PIN_NODE to prevent migration attempts.
47.8.5 Capability-Gated Migration
Process migration requires capabilities:
pub const CLUSTER_MIGRATE: u32 = 0x0200; // Allow process migration to remote nodes
pub const CLUSTER_PIN_NODE: u32 = 0x0201; // Pin process to specific node (prevent migration)
pub const CLUSTER_ADMIN: u32 = 0x0202; // Cluster-wide scheduler administration
Processes without CLUSTER_MIGRATE are never migrated. Processes with CLUSTER_PIN_NODE
can pin themselves to their current node. Containers/cgroups can restrict which nodes
their processes can run on:
/sys/fs/cgroup/<group>/cluster.nodes
# Allowed nodes for this cgroup: "0 1 2" or "all"
# Default: current node only (no migration)
/sys/fs/cgroup/<group>/cluster.migrate
# "auto" (kernel decides), "never" (pinned), "prefer" (hint)
# Default: "never" (existing Linux behavior)
47.8.6 Reconciliation: Local vs Distributed Scheduling
The single-node scheduler (Section 14) optimizes for cache locality — keeping tasks on the same CPU core. The distributed scheduler may migrate tasks across nodes, destroying all cache state.
Design principle: Cross-node migration is a last resort, not a default action.
Two-level hierarchy with strict separation:
- Intra-node (Section 14): CFS/EEVDF handles all CPU-local decisions (~4ms tick). Entirely unaware of the cluster.
- Inter-node:
ClusterSchedulerruns every 10 seconds — deliberately slow because cross-node migration is 1000x more expensive than cross-CPU migration.
Migration threshold: A task migrates cross-node only when ALL of these hold:
- Source node CPU utilization exceeds 120% of cluster average AND target is below 80% (sustained for 2+ rebalance intervals), OR
- The task's working set is predominantly on the target node (>70% of pages, per
DataAffinity), OR - The task's affinity mask explicitly requests a different node.
Migration cost model:
migration_benefit = (source_load - target_load) * task_weight
migration_cost = cache_refill_time + network_transfer_time + tlb_flush_time
Migration proceeds only when migration_benefit > migration_cost * 1.5 (50% hysteresis
to prevent oscillation).
Warm-up penalty: After cross-node migration, the task's effective load weight is inflated 2x for 20 seconds, preventing immediate re-migration before cache state builds.
47.9 Network-Portable Capabilities
47.9.1 Problem
ISLE's capability system (Section 11) uses opaque kernel-memory tokens validated locally. For distributed operation, capabilities must work across nodes: a process migrated from Node A to Node B should retain its capabilities. A remote RDMA operation should be authorized by a capability that the remote node can verify.
47.9.2 Design: Cryptographically-Signed Capabilities
// isle-core/src/cap/distributed.rs
/// Network-portable capability — split into a compact header (fits on the
/// kernel stack, ~64 bytes) and a separately-allocated signature payload.
///
/// Rationale: PQC signatures (ML-DSA-65: 3,309 bytes, hybrid: 3,373 bytes)
/// make the full capability ~3.6 KB, which must NOT be placed on the kernel
/// stack (kernel stacks are 8-16 KB). The split design keeps the hot path
/// (permission checks, expiry checks) on the stack via CapabilityHeader,
/// while the signature data lives in a slab-allocated CapabilitySignature.
#[repr(C)]
pub struct CapabilityHeader {
/// The local capability this was derived from.
pub object_id: ObjectId,
pub permissions: PermissionBits,
pub generation: u64,
pub constraints: CapConstraints,
// === Network portability extensions ===
/// Node that issued this capability.
pub issuer_node: NodeId,
/// Timestamp of issuance (cluster-relative wall clock,
/// synchronized via PTP/NTP, see Section 47.11.5).
pub issued_at_ns: u64,
/// Expiry timestamp (cluster-relative wall clock, see Section 47.11.5).
/// Capabilities MUST have bounded lifetime
/// (prevents stale capabilities after node failure).
/// Default: 5 minutes. Renewable while issuer is alive.
/// Expiry checking includes a 1ms grace period for clock skew.
pub expires_at_ns: u64,
/// Signature algorithm identifier.
/// Uses SignatureAlgorithm encoding (Section 25.2). u16 accommodates
/// hybrid algorithm IDs (0x0200+).
pub sig_algorithm: u16,
/// Pointer to the slab-allocated signature data.
/// The signature is allocated from a dedicated `cap_sig_slab` pool
/// (fixed-size 3,588-byte slots) to avoid general heap allocation
/// on the capability verification path.
pub signature: *const CapabilitySignature,
}
/// Signature payload for a distributed capability.
/// Allocated from a dedicated slab allocator (`cap_sig_slab`), NOT from
/// the general heap, to ensure bounded allocation latency on the
/// capability verification path.
///
/// Signature formats supported for distributed capabilities:
/// - Ed25519: 64 bytes (current default)
/// - ML-DSA-65: 3,309 bytes (PQC migration target, Section 25)
/// - Hybrid Ed25519 + ML-DSA-65: 3,373 bytes (transition mode)
///
/// SLH-DSA-128f is deliberately EXCLUDED from distributed capabilities.
/// Its 17,088-byte signatures would add ~17 KB to every cross-node
/// capability transfer, making it impractical for runtime operations
/// that occur at lock-acquisition frequency. SLH-DSA-128f is supported
/// for boot signatures (Section 22, KernelSignature/DriverSignature
/// structs with 17,408-byte buffers) where it is verified once at load
/// time. For distributed capabilities, ML-DSA-65 provides NIST PQC
/// security at 1/5th the signature size.
///
/// MAX_DISTRIBUTED_SIG_BYTES = 3,584 (3.5 KiB, accommodates hybrid
/// signatures with alignment headroom).
#[repr(C)]
pub struct CapabilitySignature {
/// Actual signature length in bytes.
pub sig_len: u16,
pub _pad: [u8; 2],
/// Signature data. Only sig_len bytes are meaningful.
pub data: [u8; 3584],
}
/// Convenience type combining header + signature for full capability operations.
/// Never placed on the stack as a whole — the header is on the stack and
/// the signature is accessed via pointer.
pub type DistributedCapability = (CapabilityHeader, *const CapabilitySignature);
Memory layout note: The
CapabilityHeaderis ~64 bytes and safe to place on the kernel stack. TheCapabilitySignatureis 3,588 bytes and allocated from a dedicated slab pool (cap_sig_slab, 3,588-byte slots) — never on the stack. For RDMA transmission, the capability is serialized using a compact wire format that includes onlysig_lenbytes of signature data, reducing typical message size to ~200-400 bytes (Ed25519: 64-byte signature + ~100 bytes of header fields). Thesig_lenfield determines how many bytes of the signature are included in the wire format.
Slab Allocator and DoS Mitigation: The cap_sig_slab pool uses per-process quotas
to prevent a malicious process from exhausting kernel memory by allocating many signatures:
/// Per-process quota for capability signature allocations.
/// Tracked in the process's capability space (Section 11).
pub struct CapSignatureQuota {
/// Maximum signature slots this process may hold simultaneously.
/// Default: 1024 (approximately 3.6 MB worst case).
/// Configurable via prctl(PR_SET_CAP_QUOTA).
pub max_slots: u32,
/// Currently allocated slots for this process.
/// Incremented on successful slab allocation, decremented on free.
pub used_slots: AtomicU32,
}
/// Default quota: 1024 signatures (~3.6 MB per process).
pub const DEFAULT_CAP_SIGNATURE_QUOTA: u32 = 1024;
/// System-wide limit on total signature slab memory.
/// Default: 1 GB (approximately 290,000 signatures).
pub const CAP_SIG_SLAB_TOTAL_LIMIT: usize = 1024 * 1024 * 1024;
Allocation protocol:
- When a process derives a
DistributedCapability, the kernel attempts to allocate a signature slot fromcap_sig_slab. - Before allocation, check
used_slots.load(Acquire) < max_slots. If exceeded, fail withEAGAIN(transient) — the process must free existing capabilities first. - After successful allocation, increment
used_slotswithReleaseordering. - On capability drop (explicit revoke, expiry, or process exit), decrement
used_slotsand return the slot to the slab.
Eager reclamation: When a process terminates (normal exit or killed), all its
CapabilitySignature slots are freed immediately. The capability space tracks
all signatures via an intrusive linked list per process — O(1) enumeration for cleanup.
Memory pressure handling: If the system-wide cap_sig_slab exceeds 80% of
CAP_SIG_SLAB_TOTAL_LIMIT, the kernel triggers a reclamation pass:
- Scan all processes for expired capabilities (where expires_at_ns < now).
- Free signatures for expired capabilities regardless of process quota.
- This ensures that expired capabilities don't accumulate under memory pressure.
Rationale for 1024-slot default: A typical process holds <100 distributed capabilities (file handles, memory regions, accelerator contexts). 1024 provides 10x headroom while limiting worst-case memory consumption to ~3.6 MB per process. For specialized workloads (distributed databases, HPC orchestrators), the admin can raise the limit via prctl.
47.9.3 Verification
Remote capability verification (any node):
1. Process on Node B presents a CapabilityHeader + CapabilitySignature to access
a resource. The header is on the stack; the signature is dereferenced from the slab.
2. Node B checks:
a. Signature valid? (verify with issuer_node's public key using the
algorithm specified in sig_algorithm — Ed25519 ~25 μs typical, ML-DSA-65 ~110 μs)
→ done once, then cached for lifetime of capability (keyed by object_id + generation)
b. Not expired? (compare expires_at_ns with current time)
→ ~10 ns (clock comparison)
c. Generation still valid? (check local revocation list)
→ ~100 ns (hash table lookup)
d. Permissions sufficient for requested operation?
→ ~10 ns (bitfield comparison)
3. If all checks pass: operation is authorized.
Total first-time verification: ~25 μs (Ed25519) or ~110 μs (ML-DSA-65).
Subsequent verifications (signature cached): ~200 ns.
47.9.4 Revocation
Capability revocation in a distributed system is harder than on a single node because capabilities may be cached on remote nodes that are temporarily unreachable.
Single-node revocation (Section 11.1): Generation-based. O(1), no table scanning.
Distributed revocation protocol:
- Initiation: Home node increments the object's generation in the home directory.
- Broadcast: Home node sends
CapRevoke { object_id, new_generation }to all nodes holding capabilities for this object. The grant log tracks which nodes are affected — only those nodes receive the broadcast. - Acknowledgment: Each remote node marks matching capabilities invalid, responds
with
CapRevokeAck. - Stale rejection: If a remote node presents an old-generation capability to the home node before receiving the broadcast, the home node rejects it immediately.
- Partition handling: If a node is unreachable, revocation messages are queued in a durable per-node outbox. When the partition heals, queued revocations replay in order. During the partition, stale capabilities work for local cached reads but fail for any operation requiring home-node validation.
Consistency guarantee: Revocation is eventually consistent. After broadcast completes and all nodes acknowledge, the capability is universally invalid. During the broadcast window (typically <1ms on RDMA), stale capabilities may succeed for local cached reads but fail for home-node operations.
RDMA optimization: Broadcasts use RDMA SEND over reliable connected (RC) queue pairs for guaranteed delivery and ordering.
Interaction with expiry: Distributed capabilities have bounded lifetimes (default: 5 minutes). Revocation is the fast path; expiry is the safety net — even if revocation messages are permanently lost, no stale capability survives beyond its expiry window.
Scalability: The broadcast is O(N) where N = nodes holding capabilities for the specific object (typically 2-5), not the cluster size.
47.9.5 Use Case: Remote GPU Access
Process on Node A wants to submit work to GPU on Node B:
1. Process has local capability: ACCEL_COMPUTE for GPU on Node B.
2. Kernel derives DistributedCapability, signs with Node A's key.
3. Kernel sends command submission + capability to Node B via RDMA.
4. Node B's kernel verifies capability:
- Signature valid (Node A's key)
- Not expired
- Not revoked
- Has ACCEL_COMPUTE permission for this GPU
5. Node B's AccelScheduler accepts the submission.
6. Completion notification sent back to Node A via RDMA.
The process on Node A uses the same AccelContext API
as for a local GPU. The kernel handles the distribution.
47.10 Distributed Device Fabric
47.10.1 Remote Device Access
The KABI vtable model (Section 7) naturally extends to remote devices. A device on Node B can be accessed from Node A through a proxy driver:
Local device access (existing):
Process → syscall → ISLE Core → KABI vtable call → Driver → Device
Remote device access (new):
Process → syscall → ISLE Core → KABI vtable call → RDMA Proxy Driver
→ RDMA transport → Node B ISLE Core → KABI vtable call → Driver → Device
The proxy driver implements the same KABI vtable as the real driver.
It translates vtable calls into RDMA messages. ISLE Core cannot tell
the difference between a local driver and a proxy driver.
47.10.2 Proxy Driver
// isle-core/src/distributed/proxy.rs
/// A proxy driver that forwards KABI vtable calls to a remote node.
/// Appears as a normal driver to the local kernel.
pub struct RemoteDeviceProxy {
/// Remote node hosting the actual device.
remote_node: NodeId,
/// Remote device ID (on the remote node's device registry).
remote_device_id: DeviceNodeId,
/// RDMA connection to remote node.
transport: KernelTransport,
/// Cached device info (refreshed periodically).
cached_info: AccelDeviceInfo,
/// Capability header for accessing the remote device.
/// The associated CapabilitySignature is slab-allocated and referenced
/// via the header's signature pointer.
remote_cap: CapabilityHeader,
/// Outstanding remote calls (for timeout/cancellation).
/// Uses a slab allocator (not a BTreeMap) for O(1) insertion and lookup by
/// call ID: the call ID encodes the slab slot index directly, so no tree
/// traversal is needed. The slab is bounded by the per-proxy in-flight limit
/// (max_inflight from NodeConnection), which is fixed at connection setup time.
pending_calls: SlabMap<PendingRemoteCall>,
}
This enables:
| Scenario | How It Works |
|---|---|
| Node A uses GPU on Node B | AccelBase proxy over RDMA. Command buffers sent via RDMA Write. |
| Node A reads NVMe on Node B | Block I/O proxy over RDMA (equivalent to NVMe-oF, but kernel-native). |
| Node A uses FPGA on Node B | AccelBase proxy. Same vtable, different device class. |
| Kubernetes GPU sharing | Kubelet on Node A schedules pod needing GPU. GPU is on Node B. Kernel-transparent. |
47.10.3 GPUDirect RDMA Across Nodes
For GPU-to-GPU communication across nodes (essential for distributed training):
Current state (NCCL on Linux):
GPU 0 (Node A) → PCIe → CPU RAM (Node A) → RDMA NIC → Network →
→ RDMA NIC → CPU RAM (Node B) → PCIe → GPU 0 (Node B)
Copies: 2 (GPU→CPU, CPU→GPU). CPU involvement: yes.
With GPUDirect RDMA (supported by Mellanox NICs + NVIDIA GPUs):
GPU 0 (Node A) → PCIe → RDMA NIC → Network →
→ RDMA NIC → PCIe → GPU 0 (Node B)
Copies: 0. CPU involvement: none.
ISLE integration:
- P2P DMA (Section 43) handles local GPU↔NIC path
- KernelTransport handles the RDMA portion
- RdmaDeviceVTable.register_device_mr() (Section 46.1.3) registers GPU VRAM for RDMA
- Combined path: GPU→NIC→Network→NIC→GPU, zero CPU copies
The kernel manages the IOMMU mappings on both ends, ensuring that:
- GPU VRAM is registered as an RDMA memory region
- RDMA NIC has IOMMU permission to DMA to/from GPU BAR
- Remote node's RDMA NIC has permission via remote rkey
- Capability system authorizes the cross-node GPU-to-GPU transfer
47.11 Failure Handling and Split-Brain
47.11.1 Failure Model
Distributed systems have failure modes that single-machine kernels don't:
| Failure | Detection | Recovery |
|---|---|---|
| Node crash (power loss) | Heartbeat timeout (Suspect at 300ms, Dead at 1000ms per Section 47.11.2) | Reclaim resources, invalidate capabilities |
| Network partition | Heartbeat timeout + asymmetric | Split-brain protocol (Section 47.11.3) |
| RDMA NIC failure | Link-down event + failed RDMA ops | Fallback to TCP or isolate node |
| Slow node (Byzantine) | Heartbeat latency spike | Mark suspect, reduce trust |
| Storage failure | I/O error from block driver | FMA-managed (Section 39) |
47.11.2 Heartbeat Protocol
// isle-core/src/distributed/heartbeat.rs
pub struct HeartbeatConfig {
/// Heartbeat interval.
/// Default: 100ms (10 heartbeats/sec).
pub interval_ms: u32,
/// Miss count before marking node Suspect.
/// Default: 3 (300ms of silence → Suspect).
pub suspect_threshold: u32,
/// Miss count before marking node Dead.
/// Default: 10 (1000ms of silence → Dead).
pub dead_threshold: u32,
/// Heartbeat uses RDMA Send (two-sided) not RDMA Write (one-sided)
/// because we need the remote CPU to respond (proof of liveness).
pub transport: HeartbeatTransport,
}
// The heartbeat sender and receiver threads run at SCHED_FIFO priority
// (configurable, default priority 50) to avoid false suspect transitions
// caused by CPU saturation delaying heartbeat processing. For non-RDMA
// (TCP) clusters where network latency is higher and more variable, the
// recommended defaults are: interval_ms=500, suspect_threshold=3 (1500ms),
// dead_threshold=10 (5000ms).
/// Heartbeat message (sent via RDMA Send, 64 bytes).
#[repr(C)]
pub struct HeartbeatMessage {
/// Sender node ID.
pub node_id: NodeId, // 4 bytes
pub _pad0: u32, // 4 bytes — explicit alignment padding for generation
/// Monotonic generation (incremented on node restart).
/// If generation changes, it means the node rebooted.
pub generation: u64,
/// Sender's current timestamp (for clock skew estimation).
pub timestamp_ns: u64,
/// Sender's load summary (for cluster scheduler).
pub cpu_percent: u32,
pub memory_pressure: u32,
pub accel_percent: u32,
pub _pad1: u32, // 4 bytes — explicit alignment padding for membership_view
/// Sender's view of cluster membership (u64 bitfield, MAX_CLUSTER_NODES = 64).
pub membership_view: u64,
/// Explicit padding to bring total struct size to 64 bytes (one cache line).
/// Layout: node_id(4) + _pad0(4) + generation(8) + timestamp_ns(8) +
/// cpu_percent(4) + memory_pressure(4) + accel_percent(4) + _pad1(4) +
/// membership_view(8) + _pad(16) = 64.
pub _pad: [u8; 16],
}
47.11.3 Split-Brain Resolution
Network partition can cause split-brain: two groups of nodes each believe the other group has failed.
Strategy: Majority quorum + lease-based fencing tokens.
Cluster: Nodes {A, B, C, D, E} (5 nodes)
Partition: {A, B, C} can talk to each other. {D, E} can talk to each other.
Neither group can reach the other.
Resolution:
1. Each group counts its members.
2. {A, B, C} has 3/5 = majority. Continues operating.
3. {D, E} has 2/5 = minority. Enters read-only mode.
- No new DSM writes (avoids conflicting updates).
- No process migrations.
- Local workloads continue (processes already on D, E keep running).
- Remote page faults that target {A, B, C} get errors.
4. When partition heals:
- {D, E} rejoin the cluster.
- DSM directory entries are reconciled (version numbers resolve conflicts).
- Process affinity recalculated.
**Lease-Based Fencing Tokens**: To prevent split-brain ambiguity, ISLE uses
monotonically-increasing fencing tokens (lease epochs) for all cluster-wide
operations:
```rust
/// Fencing token for split-brain prevention.
/// Monotonically increments on every quorum leadership change.
/// Used to invalidate stale operations from partitioned minorities.
#[repr(C)]
pub struct FencingToken {
/// Monotonically increasing epoch.
/// Incremented when:
/// 1. Cluster membership changes (node join/leave)
/// 2. Quorum leader changes (leader failure)
/// 3. Admin triggers manual fencing (maintenance)
pub epoch: u64,
/// Node ID of the quorum leader that issued this token.
/// Used for token validation and leader identification.
pub leader_node: NodeId,
/// Timestamp when this token was issued (cluster-relative, PTP-synchronized).
/// Tokens expire after FENCING_TOKEN_TTL_NS (default: 30 seconds).
pub issued_at_ns: u64,
}
/// Token time-to-live: tokens older than this are considered invalid.
/// Must be longer than the worst-case partition detection time (heartbeat timeout).
pub const FENCING_TOKEN_TTL_NS: u64 = 30_000_000_000; // 30 seconds
Token propagation protocol:
- Leader election: When cluster membership changes, the new quorum leader
(determined by deterministic rules below) increments the fencing epoch and
broadcasts the new
FencingTokento all nodes in its partition. - Token validation: Every DSM write and capability grant includes the sender's
current fencing token. The receiver rejects operations with stale tokens
(
token.epoch < local_token.epoch) or expired tokens (now - token.issued_at_ns > FENCING_TOKEN_TTL_NS). - Partition healing: When partitions heal, the surviving leader's fencing token takes precedence. Nodes from the minority partition discard their stale tokens and adopt the majority's token before resuming normal operations.
Deterministic Tie-Breaking Rules: To ensure all nodes independently reach the same decision during a partition, ISLE applies the following rules in strict order:
Rule 1: Majority Wins — The partition with >50% of nodes wins. - 5-node cluster: 3+ nodes wins. - 6-node cluster: 4+ nodes wins.
Rule 2: Larger Node Set Wins (for equal-size partitions) — If two partitions have exactly the same number of nodes (possible only in even-sized clusters), the partition with the higher total node ID sum wins. - Example: In a 4-node cluster {A=1, B=2, C=3, D=4}, partition {A, D} (sum=5) loses to partition {B, C} (sum=5) by rule 2a below. - Rule 2a: Equal Sum → Lowest Minimum Node ID Wins: If sums are equal (as in the example), the partition whose lowest node ID is smaller wins. {A, D} has min=1; {B, C} has min=2. {A, D} wins. - This provides a total order over all possible equal-size partitions without requiring external coordination.
Rule 3: External Witness (optional) — An admin-configured external witness (node ID 255, or a dedicated witness VM) can act as a tiebreaker. If configured, the partition containing the witness wins all ties. This overrides Rules 2/2a.
Rule precedence: Rule 1 > Rule 3 > Rule 2 > Rule 2a.
Implementation for 64-node clusters with fixed-size bitmasks:
/// Deterministically select the winning partition from two candidates.
/// Returns true if partition_a wins over partition_b.
///
/// # Preconditions
/// - Partitions are disjoint (no node in both)
/// - Neither partition is empty
pub fn partition_wins(partition_a: u64, partition_b: u64, cluster_size: u8,
witness_mask: u64) -> bool {
let count_a = partition_a.count_ones();
let count_b = partition_b.count_ones();
// Rule 1: Majority wins
let majority_threshold = (cluster_size as u32 + 1) / 2;
if count_a > majority_threshold { return true; }
if count_b > majority_threshold { return false; }
// Rule 3: External witness (if configured)
if witness_mask != 0 {
let witness_in_a = (partition_a & witness_mask) != 0;
let witness_in_b = (partition_b & witness_mask) != 0;
if witness_in_a && !witness_in_b { return true; }
if witness_in_b && !witness_in_a { return false; }
}
// Rule 2: Larger partition wins
if count_a != count_b { return count_a > count_b; }
// Rule 2a: Equal size → higher sum wins
let sum_a = node_id_sum(partition_a);
let sum_b = node_id_sum(partition_b);
if sum_a != sum_b { return sum_a > sum_b; }
// Rule 2a final tiebreaker: lower minimum node ID wins
// (guaranteed to differ for disjoint partitions of equal size)
min_node_id(partition_a) < min_node_id(partition_b)
}
/// Compute sum of node IDs in a partition bitmask.
fn node_id_sum(partition: u64) -> u32 {
let mut sum = 0u32;
let mut mask = partition;
while mask != 0 {
let lowest = mask.trailing_zeros();
sum += lowest + 1; // node IDs are 1-indexed
mask &= !(1u64 << lowest);
}
sum
}
/// Find the minimum node ID in a partition bitmask.
fn min_node_id(partition: u64) -> u32 {
partition.trailing_zeros() + 1 // node IDs are 1-indexed
}
Raft/Paxos for Critical Cluster Metadata: For cluster-critical state that cannot tolerate any inconsistency (security policies, node authentication keys, cluster configuration), ISLE uses Raft consensus rather than the simpler majority-quorum protocol:
- Raft scope: Security policy replication, node certificate revocation, cluster-wide configuration changes (adding/removing nodes).
- Quorum protocol scope: DSM page ownership, capability caching, heartbeat.
- Rationale: Raft provides linearizability but requires persistent logging and leader election overhead. Using it only for critical metadata avoids imposing Raft's cost on the high-frequency DSM operations.
The Raft implementation is a simplified single-term leader election with
log replication, managed by the ClusterMetadataReplicator service. Leader
election uses the same deterministic rules above to avoid split-brain during
Raft leader selection.
In-flight RDMA operations during partition: When a network partition occurs, RDMA
NICs report completion with error status for all in-flight operations. KernelTransport
translates these to TransportError::ConnectionLost. Callers (DSM fault handler, IPC
ring) retry once (in case of transient link flap), then return an error to the process:
SIGBUS for DSM page faults, EIO for IPC operations.
Minority partition DSM behavior: Nodes in the minority partition mark all DSM pages
as SUSPECT. Write accesses to SUSPECT pages are blocked (write-protect in page tables);
write attempts fault and return EAGAIN. Read accesses to SUSPECT pages continue to
return locally-cached data (preserving availability for read-heavy workloads during
brief partitions), but set a per-page stale_read flag. Applications can check whether
they have read potentially stale data via madvise(MADV_DSM_CHECK), which returns
EDSM_STALE if any SUSPECT pages were read since the last check. This ensures that
stale reads are never silent — applications that care about consistency can detect and
handle them, while applications that tolerate staleness continue without interruption.
When the partition heals, SUSPECT pages are reconciled with the majority partition's
directory and the SUSPECT marking is cleared.
Unreachable home node: Each DSM page has two home nodes: primary (determined by
hash(region, VA) % cluster_size) and backup (determined by hash_backup(region, VA)
% cluster_size using a different hash seed, guaranteed to differ from the primary).
Backup home node protocol:
1. Shadow directory maintenance: On every directory state change (ownership
transfer, reader set update), the primary home node sends the updated
DsmDirectoryEntry to the backup via RDMA Write to a pre-allocated shadow
directory region on the backup node. Each entry includes a generation counter
(incremented on every update) for consistency verification.
2. Consistency: The backup's shadow directory is write-only from the primary's
perspective — the backup never modifies it independently. The generation counter
ensures that stale writes (e.g., reordered RDMA Writes) are detected and
discarded. The backup compares the incoming generation counter against its last
seen value; only strictly incrementing updates are applied.
3. Failover: When the primary home node is declared Dead by the cluster
membership protocol (Section 47.11), the new home node is determined
deterministically — not by self-promotion. The membership protocol's
NodeDead event triggers re-evaluation of the same hashing rule used for
initial home placement (Section 47.5.3): hash(region, VA) % new_cluster_size
with the dead node removed from the membership set. All surviving nodes
compute the same result, producing a single deterministic new home. No node
self-promotes. The new home is computed deterministically from the membership
epoch, so all survivors agree.
If the deterministic new home happens to be the node already holding the
backup shadow directory, it promotes the shadow to primary. Otherwise, the
backup transfers its shadow entries to the computed new home. Directory
entries on the new home may be slightly stale (by at most one in-flight
update). The version counter in each DsmDirectoryEntry allows requestors
to detect and retry if they encounter a stale entry.
4. Partition healing: When the primary returns, the primary and backup reconcile
their directories: for each entry, the node with the higher generation counter
wins. The backup reverts to shadow mode after reconciliation completes.
For even-numbered clusters (no strict majority possible): - Admin designates a "tiebreaker" node (or external witness). - Or: smaller-numbered-node-set wins (deterministic, no external dependency). Caveat: the smaller-numbered-node-set heuristic is simple but has a known weakness — if the lower-numbered nodes are physically co-located, a power event affecting that rack consistently picks the wrong survivor set. For production deployments, an external quorum device (dedicated witness VM, or a 3rd-site arbitrator via RDMA or TCP heartbeat) is recommended. The heuristic remains as the default-fallback when no external witness is configured.
#### 47.11.4 DSM Recovery After Node Failure
Node B fails (holds exclusive ownership of some DSM pages):
- Heartbeat timeout → Node B marked Dead.
- For each DSM page where B was owner: a. Home node (determined by hash) still has the directory entry. b. If B was SharedOwner: readers still have valid copies. → Promote one reader to owner (pick closest to home node). c. If B was Exclusive: page data is LOST (only copy was on B). → Mark page as "lost." Processes faulting on it get SIGBUS. → Application must handle this (checkpoint/restart).
- For each DSM page where B was a reader: a. Simply remove B from reader set. No data loss.
- Capabilities issued by B expire naturally (bounded lifetime). Remote nodes stop accepting B-signed capabilities immediately.
Mitigation for exclusive page loss:
- DSM regions can be created with DSM_REPLICATE flag (replication_factor = 2).
- Every write to an exclusive page is mirrored to a backup node.
- On failure: backup is promoted to owner. No data loss.
- Cost: 2x write bandwidth for replicated regions.
**Design rationale — why replication is NOT the default:**
DSM is a **performance optimization**, not a durability mechanism. The default
(`replication_factor = 1`) is correct for the typical DSM use case:
1. **Ephemeral data**: Caches, temporary buffers, computational scratch space. Loss on
node failure is acceptable — the data can be recomputed or reloaded from source.
2. **Read-mostly workloads**: Configuration data, reference tables, shared code pages.
These are typically backed by persistent storage; losing the in-memory copy just
means reloading from disk.
3. **Application-managed durability**: Databases, distributed file systems, and message
queues implement their own replication and checkpointing. Adding DSM-level
replication would be redundant and wasteful.
4. **Performance sensitivity**: The 2x bandwidth cost and ~15% latency overhead of
synchronous replication would penalize all DSM users, even those who don't need it.
**When to enable replication:**
- DSM regions holding irreplaceable data without application-level persistence
- Workloads where recomputation cost exceeds replication cost
- Environments where node failure is frequent enough to justify the overhead
**Durability is the application's responsibility**: Just as applications using `mmap()`
or `malloc()` must implement their own persistence, DSM users must decide whether
their data warrants replication. The kernel provides the mechanism (`DSM_REPLICATE`);
the policy is left to the application.
#### 47.11.5 Clock Synchronization
Distributed capabilities (Section 47.9) and DSM timeouts rely on nodes having
synchronized clocks. ISLE requires PTP (IEEE 1588 Precision Time Protocol) as
the primary clock synchronization mechanism:
- **PTP grandmaster**: One node (or a dedicated PTP appliance) serves as the time
reference. All other nodes synchronize to it via hardware PTP timestamping on the
RDMA NIC.
- **Expected accuracy**: <1 μs with hardware PTP (typical for modern RDMA NICs with
PTP hardware timestamping support).
- **NTP fallback**: If PTP is not available (no hardware support), NTP is used as a
fallback. Expected accuracy: 1-10 ms. When using NTP, capability expiry grace period
is increased to 100ms (from the default 1ms with PTP).
- **Maximum acceptable skew**: 1ms for PTP deployments. Capability expiry includes a
1ms grace period to account for this skew. Nodes with clock skew exceeding 10ms
trigger an FMA alert (reported as a `HealthEvent` with
`class: HealthEventClass::Network`, event code `CLOCK_SKEW_EXCEEDED` — clock skew is
a network-level health event, not its own event class).
- **Clock skew estimation**: Each heartbeat message (Section 47.11.2) includes the
sender's timestamp. The receiver estimates one-way clock skew as
`(remote_ts - local_ts - RTT/2)`. Persistent skew > 1ms triggers a PTP
resynchronization.
---
### 47.12 CXL 3.0 Fabric Integration
#### 47.12.1 Why CXL Changes Everything
CXL (Compute Express Link) 3.0 provides **hardware-coherent** shared memory across
a PCIe fabric. Unlike RDMA (which requires software coherence protocols), CXL memory
is accessed via normal CPU load/store instructions with hardware cache coherence.
> **Hardware availability caveat**: CXL 3.0 hardware with full shared memory semantics
> (back-invalidate snooping for multi-host coherence) is not yet commercially available
> as of 2025. ISLE's CXL support targets CXL 2.0 Type 3 devices (available in Sapphire
> Rapids, Genoa) with software-managed coherence. CXL 3.0 back-invalidate snooping will
> be supported when hardware becomes available. The CXL 3.0 sections below describe the
> target architecture; implementation is gated on hardware availability.
Memory access latency spectrum:
Local DRAM (same socket): ~50-80 ns (load/store) Local DRAM (cross-socket): ~100-150 ns (load/store, QPI/UPI) CXL 2.0 attached memory: ~200-400 ns (load/store, PCIe + CXL) CXL 3.0 shared memory pool: ~200-500 ns (load/store, coherent) RDMA: ~2000-5000 ns (explicit transfer) NVMe SSD: ~10000 ns (block I/O)
CXL 3.0 shared memory is 5-25x faster than RDMA because:
1. No software protocol (hardware coherence via CXL.cache, CXL.mem)
2. Cache-line granularity (64 bytes, not 4KB pages)
3. No memory registration overhead
4. CPU load/store instructions, not DMA engine
#### 47.12.2 Design: CXL as a First-Class Memory Tier
`PageLocation` (defined canonically in Section 43.1.5 of 11-accelerators.md,
reproduced in Section 47.5.5 above) is reused here for distributed page placement
decisions. See Section 47.5.5 for the full enum definition including RDMA-specific
variants.
```rust
// Extend NumaNodeType (from Section 43.1.3)
pub enum NumaNodeType {
CpuMemory,
AcceleratorMemory { /* ... */ },
CxlMemory {
latency_ns: u32,
/// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
bandwidth_gbs: u32,
},
// NEW: CXL 3.0 shared memory pool (visible to multiple nodes)
CxlSharedPool {
/// Latency from this CPU to the shared pool.
latency_ns: u32,
/// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
bandwidth_gbs: u32,
/// Nodes that share this pool (u64 bitfield, MAX_CLUSTER_NODES = 64).
sharing_nodes: u64,
/// Hardware coherence protocol version.
coherence_version: CxlCoherenceVersion,
},
}
#[repr(u32)]
pub enum CxlCoherenceVersion {
/// CXL 2.0: pooled memory, no hardware coherence between hosts.
/// Kernel manages coherence via software protocol (like DSM Section 47.5).
Cxl20Pooled = 0,
/// CXL 3.0: hardware-coherent shared memory.
/// CPU cache coherence protocol extended across CXL fabric.
/// No software coherence needed — hardware handles it.
Cxl30Coherent = 1,
}
47.12.3 CXL + RDMA Hybrid
In a realistic datacenter, both CXL and RDMA will coexist:
Rack-level (CXL 3.0 fabric, ~200-500 ns):
┌─────────┐ CXL ┌─────────┐ CXL ┌─────────┐
│ Node 0 │◄────────►│ CXL │◄────────►│ Node 1 │
│ CPU+GPU │ │ Switch │ │ CPU+GPU │
└─────────┘ │ +Memory │ └─────────┘
│ Pool │
└────┬────┘
│ CXL
┌────┴────┐
│ Node 2 │
│ CPU+GPU │
└─────────┘
Cross-rack (RDMA, ~2-5 μs):
Rack 0 ◄──── 400GbE RDMA ────► Rack 1
Memory tier ordering within this topology:
Tier 1: Local DRAM (~80 ns)
Tier 2: CXL shared pool (~300 ns) ← same rack
Tier 3: GPU VRAM (~500 ns)
Tier 4: Compressed (~1-2 μs to decompress)
Tier 5: Remote DRAM via RDMA (~3 μs) ← cross rack
Tier 6: Local NVMe (~12 μs)
Tier 7: Remote NVMe via NVMe-oF/RDMA ← cross rack
The kernel detects CXL and RDMA links automatically (device registry)
and builds the distance matrix accordingly. No manual configuration
of tier ordering — the measured latencies determine placement policy.
47.12.4 CXL Shared Memory for DSM
When CXL 3.0 hardware-coherent shared memory is available, the DSM protocol (Section 47.5) simplifies dramatically:
DSM over RDMA (software coherence):
- Page fault → directory lookup → ownership transfer → RDMA page transfer
- ~10-25 μs per fault
- Software TLB invalidation protocol
DSM over CXL 3.0 (hardware coherence):
- Map CXL shared pool pages into process address space
- CPU load/store works directly (hardware coherence)
- No page faults for coherence (hardware handles cache-line invalidation)
- ~200-500 ns access latency (same as accessing CXL memory)
- Software DSM protocol not needed for CXL-connected nodes
The kernel uses CXL shared memory when available (intra-rack),
and falls back to RDMA-based DSM for cross-rack communication.
Best transport is selected automatically per page.
DSM redundancy and DLM: For node pairs connected via a CXL 3.0 fabric, the
software DSM page-ownership state machine (§47.5.2) is redundant — the hardware
handles cache-line-granularity coherence without software page faults or ownership
transfers. ISLE's DSM routing layer uses CxlPool transport (load/store) for these
pairs and skips the full coherence protocol.
However, the Distributed Lock Manager (DLM, §47.6) remains fully required even
with CXL 3.0. Hardware cache coherence does not provide mutual exclusion semantics
for arbitrary data structures (spinlocks, reader-writer locks, cross-node transactions).
DLM lock acquisition (LOCK_ACQUIRE, atomic CAS on lock words) continues to operate
as specified — CXL just makes the lock words accessible via load/store rather than
RDMA, which reduces lock round-trip latency but does not eliminate the need for the
protocol.
CXL 3.0 node pair — what changes vs. RDMA:
DSM page faults: eliminated (hardware coherence)
DSM ownership xfer: eliminated (no protocol needed)
DLM lock acquire: unchanged in protocol, load/store instead of RDMA
DLM deadlock detect: unchanged
Heartbeat / membership: unchanged
Capability tokens: unchanged
47.12.5 CXL Devices as ISLE Peers
The three CXL device types map to distinct ISLE peer operating models. The classification determines Mode A vs. Mode B, peer role (compute vs. memory-manager), and crash recovery behavior.
47.12.5.1 Type 1: Coherent Compute (CXL.cache only)
Type 1 devices have their own compute and a cache that participates in the host CPU's
coherency domain via CXL.cache. They have no device-managed DRAM visible via CXL.
ISLE operating model:
- Mode B peer (hardware-coherent transport) — CXL.cache IS the cache coherence
protocol. Ring buffers placed in the shared region are coherent by hardware without
any explicit cache flush or ownership protocol in software.
- Full compute peer — runs ISLE or a firmware shim (Paths A/B). Participates in
cluster membership, heartbeat, DLM, and DSM as a regular cluster node.
- FLR cache flush requirement — see §47.2.5.2. On FLR the device must flush all
CXL.cache dirty lines to host memory. Host waits for FLR completion before
accessing the shared region.
47.12.5.2 Type 2: Coherent Compute + Device Memory (CXL.cache + CXL.mem)
Type 2 is the richest CXL peer type. The device has both a coherent cache
(CXL.cache) AND device-local DRAM accessible to the host via load/store
(CXL.mem).
ISLE operating model:
- Mode B peer with bidirectional zero-copy — coherent in both directions:
device cache participates in host coherency domain (CXL.cache); host CPU can
directly load/store device DRAM as a NUMA memory tier (CXL.mem). Neither
direction requires DMA or explicit transfer.
- Memory tier AND compute peer — the device's DRAM is registered as
NumaNodeType::CxlMemory (or CxlSharedPool if multi-host). The device's
cores run ISLE and execute workloads. Same ClusterNode participates in both
memory placement decisions and compute scheduling.
- Ring buffers — can live in either host DRAM (coherent via CXL.cache) or
device DRAM (coherent via CXL.mem + host load/store). Placement policy: ring
buffers go in the lower-latency region as measured at runtime.
- FLR cache flush — same requirement as Type 1.
- Examples: future CXL-attached GPUs with HBM, CXL AI accelerators.
47.12.5.3 Type 3: Memory Expander (CXL.mem only, minimal compute)
Type 3 devices provide DRAM accessible to the host via CXL.mem. They have no
accelerator compute from CXL's perspective, though they may have a small management
processor (typically ARM Cortex-A5x or similar).
ISLE operating model — memory-manager peer, not compute peer: - The management processor runs a minimal ISLE instance (Path A via AArch64 build target) with a single responsibility: managing the DRAM pool. - What it does: monitors pool health, handles tiering decisions (hot/cold page migration across CXL memory sub-regions), manages optional compression or encryption, retires bad pages, reports ECC errors and media errors to the host cluster membership layer. - What it does NOT do: run application workloads, participate in DSM ownership transfers (the pool appears as a NUMA node to the host, not as a peer's memory), or require the full cluster protocol. It uses a lightweight subset: heartbeat + health reporting + pool management messages. - No DLM, no DSM page protocol: the Type 3 peer does not own pages in the distributed sense. The host NUMA allocator owns pages in the CXL pool; the management processor just monitors and maintains the physical medium.
CXL 3.0 multi-host pool: when a CXL switch connects multiple hosts to the same Type 3 pool, the management processor becomes a shared pool coordinator: - Arbitrates allocation between hosts (each host has a capacity reservation) - Reports pool-wide health events to all connected hosts - Does not arbitrate cache coherence (hardware does that via CXL.cache back-invalidate) - Does not run DLM for cross-host locking (ISLE DLM handles that over CXL.cache)
Crash recovery: see §47.2.5.5 CXL Type 3 section. DRAM persists; management
layer is lost until management processor recovers. Pool transitions to
ManagementFaulted state; uncompressed/unencrypted pages remain accessible.
47.12.5.4 CXL Switch as Fabric Manager Node
CXL 3.0 introduces intelligent CXL switches that route traffic between multiple hosts and multiple memory/compute devices. A CXL switch with embedded compute (ARM or RISC-V management processor) can run ISLE as a fabric manager node:
- Topology discovery: the switch sees all CXL endpoints and can report the full fabric topology (which hosts can reach which memory pools, with latency measurements) to the ISLE cluster membership layer.
- Routing policy: the switch can enforce traffic shaping, QoS, or bandwidth partitioning between hosts sharing the same CXL fabric.
- No data plane participation: the switch does not own memory pages, run workloads, or participate in DLM. It is a topology and routing oracle.
- This is a future capability: CXL 3.0 switch hardware with embedded compute is not yet commercially available (2025). The architecture is ready; the FabricManager node type and associated cluster messages are deferred to Phase 5+ (§47.16).
47.12.5.5 Summary Table
| CXL Type | Compute | Memory | ISLE Peer Role | Transport | Crash Model |
|---|---|---|---|---|---|
| Type 1 | Yes (CXL.cache) | No | Full compute peer | Mode B | FLR flush + standard recovery |
| Type 2 | Yes (CXL.cache) | Yes (CXL.mem) | Compute + memory tier peer | Mode B | FLR flush + standard recovery |
| Type 3 | No (mgmt only) | Yes (CXL.mem) | Memory-manager peer | Heartbeat + pool msgs | ManagementFaulted; DRAM persists |
| CXL Switch | Yes (mgmt only) | No | Fabric manager (future) | Topology msgs | Fallback to static topology |
47.13 Linux Compatibility and MPI Integration
47.13.1 Existing RDMA Applications (Unchanged)
All existing Linux RDMA applications work through the compatibility layer:
| Application | Linux Interface | ISLE Path |
|---|---|---|
| MPI (OpenMPI, MPICH) | libibverbs / libfabric | isle-compat RDMA compat layer |
| NCCL (multi-node GPU) | libibverbs + GDR | RDMA compat + GPUDirect RDMA |
| DPDK | ibverbs / EFA | RDMA compat |
| Ceph (msgr2) | RDMA transport | RDMA compat |
| Spark (RDMA shuffle) | libfabric | RDMA compat |
| Redis (RDMA) | ibverbs | RDMA compat |
The compatibility layer (isle-compat/src/rdma/) translates Linux verbs API calls
to KABI RdmaDeviceVTable calls. Binary libibverbs.so works without recompilation.
47.13.2 MPI Optimization Opportunities
MPI implementations can opt into ISLE-specific features for better performance:
Standard MPI on ISLE (no changes, works today):
MPI_Send/MPI_Recv → libibverbs → RDMA NIC
MPI_Win_create (shared memory window) → mmap + RDMA
Performance: same as on Linux
MPI on ISLE with DSM (opt-in, future):
MPI_Win_create → ISLE_SHM_MAKE_DISTRIBUTED
MPI_Put/MPI_Get → direct load/store on DSM region
Kernel handles page migration transparently.
No explicit RDMA operations needed by MPI implementation.
Benefit: MPI one-sided operations become load/store.
Latency reduction: ~2-5 μs (RDMA verbs overhead) → ~3-5 μs (page fault)
for first access, then ~50-150 ns for subsequent accesses (page is local).
For iterative algorithms (most HPC): working set becomes local after
first iteration. Subsequent iterations run at local memory speed.
47.13.3 Kubernetes / Container Integration
/sys/fs/cgroup/<group>/cluster.nodes
# Which nodes this cgroup's processes can run on
# "0 1 2 3" or "all"
/sys/fs/cgroup/<group>/cluster.memory.remote.max
# Maximum remote memory for this cgroup
/sys/fs/cgroup/<group>/cluster.accel.devices
# Allowed accelerators (including remote)
# "node0:gpu0 node0:gpu1 node1:gpu0"
Kubernetes integration:
- kubelet reads cluster topology from /sys/kernel/isle/cluster/
- Device plugin exposes remote GPUs as schedulable resources
- Pod spec: resources.limits: { isle.dev/gpu: 4, isle.dev/remote-gpu: 4 }
- kubelet sets cgroup constraints; kernel handles placement
47.13.4 ISLE-Specific Cluster Interfaces
/sys/kernel/isle/cluster/
nodes # List of cluster nodes with state
topology # Cluster distance matrix
dsm/
regions # Active DSM regions
stats # DSM page fault / migration stats
memory_pool/
total # Cluster-wide memory total
available # Cluster-wide memory available
per_node/ # Per-node breakdown
scheduler/
balance_interval_ms # Global load balance interval
migrations # Process migration count
migration_log # Recent migrations (for debugging)
capabilities/
revocation_list # Current revocation list
key_ring # Cluster node public keys
47.14 Integration with ISLE Architecture
47.14.1 Memory Manager Integration
The distributed memory features integrate with the existing MM at two points:
1. Page fault handler (extend existing):
Current: fault → check VMA → allocate page / CoW / swap-in
Extended: fault → check VMA → check PageLocationTracker:
- CpuNode → standard local fault (unchanged)
- DeviceLocal → device fault (Section 42, unchanged)
- RemoteNode → RDMA fetch from remote node (NEW)
- CxlPool → CXL load (hardware handles it) (NEW)
- NotPresent → allocate locally (unchanged)
2. Page reclaim / eviction (extend existing):
Current: LRU scan → compress or swap
Extended: LRU scan → compress, OR migrate to remote node, OR swap
- If remote memory is available and faster than swap: migrate
- Decision based on cluster distance matrix + pool availability
47.14.2 Device Registry Integration
New device types in the registry:
ClusterFabric (virtual root for cluster topology)
+-- rdma_link_0 (Node 0 ↔ Node 1, 200 Gb/s, 2.5 μs RTT)
+-- rdma_link_1 (Node 0 ↔ Node 2, 200 Gb/s, 4.0 μs RTT)
+-- cxl_link_0 (Node 0 ↔ CXL Pool 0, 64 GB/s, 300 ns)
RemoteNode (virtual device representing a remote machine)
+-- Properties:
| node-id: 1
| state: "active"
| rtt-ns: 2500
| bandwidth-gbit-s: 200
| memory-total: 549755813888
| memory-available: 137438953472
| gpu-count: 4
+-- Services published:
"remote-memory" (GlobalMemoryPool)
"remote-accel" (RemoteDeviceProxy for each GPU)
"remote-block" (RemoteDeviceProxy for each NVMe)
47.14.3 FMA Integration
New FMA health events for distributed subsystem (Section 39):
| Rule | Threshold | Action |
|---|---|---|
| RDMA link degraded | >10% packet retransmits / minute | Alert + reduce traffic |
| RDMA link down | Link-down event | Failover to TCP or isolate node |
| Remote node unresponsive | 3 missed heartbeats (300ms) | Mark Suspect |
| Remote node dead | 10 missed heartbeats (1000ms) | Mark Dead + reclaim |
| DSM page loss | Exclusive page on dead node | Alert + SIGBUS to process |
| Cluster split-brain | Membership views diverge | Quorum protocol (§47.11.3) |
| CXL memory error | Uncorrectable ECC on CXL pool | Migrate pages + Alert |
| Clock skew detected | >10ms drift between nodes | Alert (affects capability expiry) |
47.14.4 Stable Tracepoints
New stable tracepoints for distributed observability (Section 40):
| Tracepoint | Arguments | Description |
|---|---|---|
isle_tp_stable_dsm_fault |
node_id, remote_node, vaddr, latency_ns | DSM page fault |
isle_tp_stable_dsm_migrate |
src_node, dst_node, pages, bytes | DSM page migration |
isle_tp_stable_dsm_invalidate |
owner_node, reader_nodes, vaddr | DSM cache invalidation |
isle_tp_stable_cluster_join |
node_id, rdma_gid | Node joined cluster |
isle_tp_stable_cluster_leave |
node_id, reason | Node left/failed |
isle_tp_stable_cluster_migrate |
pid, src_node, dst_node, reason | Process migration |
isle_tp_stable_rdma_transfer |
src_node, dst_node, bytes, latency_ns | RDMA data transfer |
isle_tp_stable_remote_fault |
node_id, tier, vaddr, latency_ns | Remote memory access |
isle_tp_stable_global_pool_alloc |
node_id, remote_node, bytes | Global pool allocation |
isle_tp_stable_global_pool_reclaim |
node_id, bytes, reason | Global pool reclaim |
47.14.5 Object Namespace
Cluster objects in the unified namespace (Section 41):
\Cluster\
+-- Nodes\
| +-- node0\ (this machine)
| | +-- State "active"
| | +-- Memory "512 GB total, 384 GB free, 32 GB exported"
| | +-- GPUs → symlink to \Accelerators\
| +-- node1\
| +-- State "active"
| +-- Memory "512 GB total, 256 GB free, 64 GB exported"
| +-- RTT "2500 ns"
| +-- Bandwidth "200 Gb/s"
+-- DSM\
| +-- region_0\
| +-- Size "1073741824 (1 GB)"
| +-- Pages "262144 total, 131072 local, 131072 remote"
| +-- Faults "12345 total, 3.2 μs avg"
+-- MemoryPool\
| +-- Total "4096 GB (8 nodes × 512 GB)"
| +-- Available "2048 GB"
| +-- LocalExported "128 GB"
| +-- RemoteUsed "64 GB"
+-- Fabric\
+-- Links (RDMA link table with latency/bandwidth)
+-- Topology (switch-level fabric map)
Browsable via islefs:
cat /mnt/isle/Cluster/Nodes/node1/RTT
→ 2500 ns
ls /mnt/isle/Cluster/DSM/
→ region_0 region_1 region_2
47.14.6 Open Questions
The following items require further design work:
- DSM replication protocol: Full design for
replication_factor > 1, including write propagation, quorum reads, and consistency guarantees. - TCP fallback mechanism: When RDMA fails (NIC down, driver crash), transparently fall back to TCP sockets for kernel transport. Design must handle in-flight RDMA operations and re-establish connections.
- Scale extension beyond 64 nodes: Data structure changes for >64-node clusters (extended bitfields, hierarchical directories).
- io_uring migration: How to migrate io_uring instances during process migration (in-flight SQEs, registered buffers, registered files).
- TCP socket migration: How to migrate established TCP connections (sequence numbers, window state, congestion state). May require connection proxy on source node.
- Partial network partitions: Non-transitive reachability (A can reach B, B can reach C, but A cannot reach C). Current quorum model assumes transitive reachability.
47.15 Implementation Phasing
| Component | Phase | Dependencies | Notes |
|---|---|---|---|
| KernelTransport (RDMA kernel API) | Phase 3 | RDMA driver KABI | Foundation for everything |
| KernelTransport (TCP fallback) | Phase 3+ | KernelTransport | TCP socket transport for RDMA-less environments; latency ~50-200μs vs ~2-3μs RDMA |
| Cluster join / topology discovery | Phase 3 | KernelTransport | Basic cluster formation |
| Heartbeat + failure detection | Phase 3 | KernelTransport | Must have before distributed state |
| Distributed Lock Manager (§31a) | Phase 3-4 | KernelTransport, Heartbeat | RDMA-native DLM; prerequisite for clustered FS |
| Distributed IPC (RDMA rings) | Phase 3-4 | KernelTransport, IPC | Natural extension of existing IPC |
| Pre-registered kernel memory | Phase 3-4 | RDMA driver, IOMMU | Performance prerequisite |
| PageLocation RemoteNode variant | Phase 4 | Memory manager | Small MM extension |
| DSM page fault handler | Phase 4 | PageLocation, KernelTransport | Core DSM functionality |
| DSM directory (home-node hash) | Phase 4 | DSM fault handler | Page ownership tracking |
| DSM coherence protocol | Phase 4-5 | DSM directory | Multiple-reader / single-writer |
| Distributed capabilities (signed) | Phase 4 | Capability system, Ed25519 | Security foundation |
| Cooperative page cache | Phase 4-5 | DSM, VFS | Distributed page cache |
| Global memory pool (basic) | Phase 5 | DSM, cgroups | Remote memory as swap tier |
| Cluster scheduler | Phase 5 | Global pool, DSM affinity | Process migration |
| Process migration | Phase 5 | Cluster scheduler | Freeze/thaw + lazy page fetch |
| Remote device proxy | Phase 5 | KernelTransport, AccelBase | Remote GPU/NVMe access |
| GPUDirect RDMA cross-node | Phase 5 | P2P DMA, RDMA | GPU↔GPU across network |
| Split-brain resolution | Phase 5 | Heartbeat, DSM | Quorum + fencing |
| CXL 2.0 pooled memory | Phase 5 | Memory manager | CXL memory as NUMA node |
| CXL 3.0 shared memory | Phase 5+ | CXL 2.0, DSM | Hardware-coherent DSM |
| Global memory pool (advanced) | Phase 5+ | All of above | Full cluster memory management |
| Cluster-wide cgroup integration | Phase 5+ | Cluster scheduler, global pool | Kubernetes-ready |
| DSM replication (fault tolerance) | Phase 5+ | DSM, replication protocol | For critical workloads |
Priority Rationale
Phase 3-4 (Foundation): RDMA transport + cluster formation + basic DSM. This makes ISLE cluster-aware and enables distributed IPC. MPI and NCCL workloads benefit immediately from kernel-native RDMA transport.
Phase 4-5 (Practical Wins): Cooperative page cache + global memory pool + signed capabilities. This is when distributed ISLE becomes genuinely useful: remote memory as a tier, shared file caching, and secure cross-node operations.
Phase 5+ (Competitive Advantage): Process migration, CXL integration, cluster-wide resource management. Features that no other OS provides. The kernel manages a cluster of machines as a single coherent system.
47.16 Licensing Summary
| Component | IP Source | Risk |
|---|---|---|
| RDMA kernel transport | Original design (uses standard RDMA verbs) | None |
| DSM page coherence | Academic (published research: Ivy, TreadMarks, Munin, GAM) | None |
| Home-node directory | Academic (distributed hash table, published) | None |
| Global memory pool | Original design (extends NUMA model) | None |
| Cooperative page cache | Academic (published research) | None |
| Cluster scheduler | Original design (extends CBS) | None |
| Distributed capabilities | Original design (Ed25519 is public-domain) | None |
| CXL integration | CXL spec (public, royalty-free consortium) | None |
| Process migration | Academic (MOSIX concepts are published research) | None |
| Split-brain / quorum | Academic (Paxos, RAFT, published) | None |
| CRDT revocation list | Academic (Shapiro et al., published) | None |
All components are either original design or based on published academic research and open specifications. No vendor-proprietary APIs or patented algorithms.
47.17 Comparison: Why Previous DSM Projects Failed and Why This Succeeds
| Factor | Kerrighed / OpenSSI / MOSIX | ISLE Distributed |
|---|---|---|
| Kernel design | Bolted onto Linux (30M+ LOC, assumes single machine) | Designed from scratch with distribution in mind |
| Coherence granularity | Cache-line (64B) — false sharing kills performance | Page-level (4KB) — matches network latency |
| Hardware support | None (pure software coherence) | RDMA (2020s), CXL 3.0 (2025+) |
| Memory model | Patched Linux MM (invasive, broke on updates) | PageLocationTracker already supports heterogeneous tiers |
| Transport | TCP sockets (high overhead) | RDMA one-sided ops (zero remote CPU) |
| Security | Unix permissions (not network-portable) | Cryptographically-signed capabilities |
| Fault tolerance | Fragile (node failure = cluster crash) | Quorum-based, bounded-lifetime capabilities, graceful degradation |
| Application compat | Modified syscall layer, broke things | Standard POSIX + opt-in extensions |
| Maintenance burden | Thousands of patches across all subsystems | Clean integration points (PageLocation, IPC transport, KABI proxy) |
| Timing | 2000s — hardware wasn't ready | 2026 — RDMA ubiquitous, CXL arriving, AI demands it |
48. SmartNIC and DPU Integration
48.1 Problem
DPUs (Data Processing Units) — NVIDIA BlueField, AMD Pensando, Intel IPU — are processors that sit on the network path. They have their own ARM cores, run their own OS, and can process network traffic, storage I/O, and security policies without using host CPU cycles.
The kernel needs a model for "this driver runs on the DPU, not on the host."
48.2 Design: Offload Tier
Extend the three-tier driver model (Section 6) with a concept of execution location:
// isle-core/src/driver/offload.rs
/// Where a driver's code executes.
/// `#[repr(C, u32)]` is required (not `#[repr(u32)]`) because the `Offloaded`
/// variant carries data fields. Plain `#[repr(u32)]` only works for fieldless
/// (C-like) enums.
#[repr(C, u32)]
pub enum DriverExecLocation {
/// Driver runs on host CPU (standard: Tier 0, 1, or 2).
Host = 0,
/// Driver runs on a DPU/SmartNIC.
/// Kernel communicates with it via PCIe mailbox or shared memory.
/// Same KABI vtable interface, different transport.
Offloaded {
/// Device ID of the DPU in the device registry.
dpu_device: DeviceNodeId,
/// Communication mechanism.
transport: OffloadTransport,
},
}
#[repr(u32)]
pub enum OffloadTransport {
/// PCIe mailbox (register-based, low-latency, small messages).
PcieMailbox = 0,
/// Shared memory region (mapped via PCIe BAR, bulk data).
SharedMemory = 1,
/// RDMA (if DPU has RDMA capability).
Rdma = 2,
}
48.3 How It Works
Standard driver (host):
Process → syscall → ISLE Core → KABI vtable call → Driver (host CPU)
Offloaded driver (DPU):
Process → syscall → ISLE Core → KABI vtable call →
→ OffloadProxy → PCIe mailbox → DPU driver → Hardware
The OffloadProxy implements the same KABI vtable as the real driver.
It translates vtable calls into DPU mailbox messages.
ISLE Core cannot tell the difference.
This is the same proxy pattern used for remote devices in the distributed kernel (Section 47.10). A DPU is essentially a very close "remote node" connected via PCIe instead of RDMA.
48.4 Use Cases
| Scenario | Host Path | DPU Path | Benefit |
|---|---|---|---|
| Network firewall | CPU processes every packet | DPU processes packets, host sees only allowed traffic | CPU freed for applications |
| NVMe-oF target | CPU handles RDMA + NVMe | DPU handles RDMA + NVMe, host CPU uninvolved | Zero host CPU for storage serving |
| IPsec / TLS | CPU encrypts/decrypts | DPU encrypts/decrypts | CPU freed, lower latency |
| vSwitch (OVS) | CPU handles VM networking | DPU handles VM networking | Major CPU savings in cloud |
| Telemetry | CPU collects and sends metrics | DPU collects and sends | No host CPU overhead |
48.5 DPU Discovery and Boot
DPU discovery is via standard PCIe enumeration (the DPU appears as a PCIe device). Firmware loading is vendor-specific: BlueField uses UEFI on the DPU's ARM cores, Intel IPU uses its own loader. The host kernel's role: 1. Enumerate DPU as PCIe device (standard PCI scan at boot). 2. Establish communication channel (PCIe mailbox or shared memory BAR). 3. KABI vtable exchange: host and DPU exchange offload capabilities. 4. DPU firmware loading is out of scope — managed by DPU's own boot chain.
48.6 DPU Failure Handling
DPU crash / reboot / PCIe link failure:
1. PCIe error detected (AER or link-down notification).
2. OffloadProxy marks all pending I/O as failed:
- Pending network I/O: -EIO to waiting processes.
- Pending storage I/O: -EIO, filesystem handles error.
- Pending control operations: -ENODEV.
3. Notify userspace via udev/netlink: DPU offline event.
4. Attempt DPU recovery:
a. PCIe hot-reset (if supported by hardware).
b. Wait for DPU firmware to re-initialize.
c. Re-establish mailbox communication.
d. Re-exchange KABI vtables.
5. If recovery succeeds: resume offloaded operations.
6. If recovery fails: fall back to host-side driver
(if one exists for the offloaded function).
Example: DPU was offloading NIC → fall back to host NIC driver.
Not all functions have host fallbacks — admin notified.
48.7 Shared State Consistency
DPU and host may share data-plane state (flow tables, packet counters) via PCIe shared memory:
Source of truth:
Data plane (flow entries, counters): DPU is authoritative.
DPU processes packets at line rate. Host reads are stale by ~μs.
Control plane (policy, configuration): Host is authoritative.
Host writes policy. DPU reads and applies.
Consistency model:
Shared memory uses producer-consumer pattern with sequence counters.
No locks across PCIe (latency too high for spinlocks).
Host writes are fenced (SFENCE) before signaling DPU via mailbox.
DPU writes are fenced before updating sequence counter.
Host reads check sequence counter for torn-write detection.
DPU Firmware Update Lifecycle:
Updating DPU firmware while services are running:
- Migrate all offloaded functions back to host-side drivers (if host fallbacks exist).
- Quiesce the DPU: drain in-flight I/O, flush shared-memory state.
- Apply firmware update via vendor-specific mechanism (BlueField:
bfb-install, Intel IPU: vendor tool). - DPU reboots with new firmware.
- Re-establish KABI vtable exchange (new firmware may have new capabilities).
- Re-offload functions that were migrated to host.
If no host fallback exists for an offloaded function, the update requires a maintenance window (the function is unavailable during DPU reboot).
DPU Multi-Tenancy:
In cloud environments, a single DPU serves multiple VMs/containers:
- SR-IOV VFs: The DPU's NIC exposes SR-IOV Virtual Functions, one per tenant. Each VF is passed through to a VM via IOMMU. Hardware isolates VF traffic.
- Hardware flow classification: The DPU's embedded switch classifies packets by flow (5-tuple or VXLAN VNI) and routes to the correct VF.
- Per-VF offload: Each tenant's offloaded functions (firewall rules, encryption keys, QoS policies) are isolated per VF. The host kernel's cgroup hierarchy maps to VF assignments.
Offload Decision Criteria:
The kernel decides whether to offload a function to the DPU based on:
- Capability: Does the DPU support this function? (Checked via KABI vtable exchange.)
- Admin policy:
/sys/kernel/isle/dpu/<id>/offload_policy—auto,manual,disabled. - Intent integration: If the cgroup's
intent.efficiencyis high, prefer DPU offload (reduces host CPU power). Ifintent.latency_nsis very low, check whether the PCIe mailbox round-trip (~1-2μs) exceeds the latency budget. - Load: If the DPU's embedded cores are saturated (utilization > 90%), do not offload additional functions.
48.8 Performance Impact
When using DPU offload: the host CPU does LESS work. Performance improves for host applications because infrastructure processing moves to the DPU.
Overhead: one PCIe mailbox round-trip (~1-2 μs) per KABI vtable call that crosses the host-DPU boundary. But DPU-offloaded drivers handle the fast path entirely on the DPU — the host only sees control-plane operations (setup, teardown, configuration changes).