Skip to content

Sections 27–33 of the ISLE Architecture. For the full table of contents, see README.md.


Part VII: Storage and Filesystems

Durability, ZFS, volume management, block storage networking, clustered filesystems, persistent memory, and computational storage.


27. Durability Guarantees

Linux problem: Applications couldn't reliably know when data was on disk. The ext4 delayed-allocation data loss bugs (2008-2009) were a symptom. Worse, fsync() error reporting was broken — errors could be silently lost between calls. Partially fixed with errseq_t in kernel 4.13 (with subsequent refinements in 4.14 and 4.16), but the contract between applications and filesystems around durability remains murky.

ISLE design: - Error reporting: Every filesystem operation tracks errors via a per-file error sequence counter. fsync() returns errors exactly once and never silently drops them. The VFS layer enforces this — individual filesystem implementations cannot bypass it. - Durability contract: Three explicit levels, documented and testable: 1. write() → data in page cache (may be lost on crash) 2. fsync() → data + metadata on stable storage (guaranteed) 3. O_SYNC / O_DSYNC → each write waits for stable storage - Filesystem crash consistency: All filesystem implementations must declare their consistency model (journal, COW, log-structured) and pass a crash-consistency test suite as part of KABI certification. - Error propagation: Writeback errors propagate to ALL file descriptors that have the file open, not just the one that triggered writeback. No silent data loss.


27a. Virtual Filesystem Layer

The VFS (isle-vfs) provides a unified interface over all filesystem types. It is a Tier 1 component running in its own hardware isolation domain (see Section 3 for platform-specific isolation mechanisms), isolated from both isle-core and the individual filesystem drivers it manages.

Why VFS is Tier 1 (not Tier 0):

The VFS handles complex, security-sensitive operations: path resolution (symlink loops, mount point crossing), permission checks, and filesystem driver coordination. Isolating VFS from Core provides:

  1. Attack surface reduction: Path resolution bugs (symlink attacks, directory traversal) are confined to the VFS domain and cannot corrupt Core memory.

  2. Driver isolation chain: Core → VFS (Tier 1) → Filesystem driver (Tier 1/2). A compromised filesystem driver cannot corrupt VFS metadata, and a compromised VFS cannot corrupt Core memory.

  3. Crash containment: A VFS panic (e.g., corrupted dentry cache) is recoverable without rebooting the entire kernel.

Performance implications and mitigation:

The Core → VFS domain switch costs ~23 cycles for the bare WRPKRU instruction (x86-64 MPK). The full domain crossing — including argument marshaling via the inter-domain ring buffer and cache effects — is ~30-35 cycles per crossing. This overhead is amortized by:

  1. Page Cache in Core: The Page Cache (Section 12.3) lives in Core, not VFS. Cached file reads/writes hit the Page Cache directly with zero domain switches. Only cache misses (actual I/O) cross into VFS.

  2. Batching: Multiple file operations within a single syscall (e.g., readv, io_uring batches) amortize the domain switch over many operations.

  3. Dentry cache hit rate: The dentry cache (in VFS) has >99% hit rate for typical workloads. Path resolution is fast, and the domain switch cost is dominated by the actual I/O latency (microseconds vs nanoseconds).

Measured overhead: For a 4KB NVMe read (~10μs device latency), the additional domain switches (Core → VFS → FS driver) add ~70 cycles (~30ns total), which is 0.3% overhead. This is well within the "<5% overhead" target.

27a.1 VFS Architecture

Responsibilities: path resolution, dentry caching, inode management, mount tree traversal, and permission checks (delegated to isle-core's capability system via the inter-domain ring buffer).

Filesystem drivers register as VFS backends. The VFS never interprets on-disk format directly — it delegates all storage operations through three trait interfaces:

/// Filesystem-level operations (mount, unmount, statfs).
/// Implemented once per filesystem type (ext4, XFS, btrfs, ZFS, tmpfs, etc.).
pub trait FileSystemOps: Send + Sync {
    /// Mount a filesystem from the given source device with flags and options.
    fn mount(&self, source: &str, flags: MountFlags, data: &[u8]) -> Result<SuperBlock>;

    /// Unmount a previously mounted filesystem.
    fn unmount(&self, sb: &SuperBlock) -> Result<()>;

    /// Return filesystem statistics (total/free/available blocks and inodes).
    fn statfs(&self, sb: &SuperBlock) -> Result<StatFs>;

    /// Flush all dirty data and metadata for this filesystem to stable storage.
    /// Backend for syncfs(2) and the filesystem-level portion of sync(2).
    fn sync_fs(&self, sb: &SuperBlock, wait: bool) -> Result<()>;

    /// Remount with changed flags/options (e.g., `mount -o remount,ro`).
    fn remount(&self, sb: &SuperBlock, flags: MountFlags, data: &[u8]) -> Result<()>;

    /// Freeze the filesystem for a consistent snapshot. All pending writes are
    /// flushed and new writes block until thaw. Used by LVM snapshots, device-mapper,
    /// and backup tools via FIFREEZE ioctl.
    fn freeze(&self, sb: &SuperBlock) -> Result<()>;

    /// Thaw a previously frozen filesystem, allowing writes to resume.
    fn thaw(&self, sb: &SuperBlock) -> Result<()>;

    /// Format filesystem-specific mount options for /proc/mounts output.
    fn show_options(&self, sb: &SuperBlock, buf: &mut [u8]) -> Result<usize>;
}

/// Inode (directory structure) operations.
/// Handles namespace operations: lookup, create, link, unlink, rename.
///
/// Note: `OsStr` is a kernel-defined type (NOT `std::ffi::OsStr`, which is
/// unavailable in `no_std`). It is a dynamically-sized type (DST) wrapping
/// `[u8]`, representing filenames that may contain arbitrary non-UTF-8 bytes
/// (Linux filenames are byte strings, not Unicode). Defined in
/// `isle-vfs/src/types.rs`:
///   `pub struct OsStr([u8]);`
/// As a DST, `OsStr` cannot be used by value — it is always behind a
/// reference (`&OsStr`) or `Box<OsStr>`. `&OsStr` is a fat pointer
/// (pointer + length), analogous to `&[u8]` but carrying the semantic
/// intent of "filesystem name component." Conversion from `&str` is
/// infallible (UTF-8 is a valid byte sequence); conversion TO `&str`
/// returns `Result` (may fail on non-UTF-8 filenames).
pub trait InodeOps: Send + Sync {
    /// Look up a child entry by name within a parent directory.
    fn lookup(&self, parent: InodeId, name: &OsStr) -> Result<InodeId>;

    /// Create a regular file in the given directory.
    fn create(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;

    /// Create a subdirectory.
    fn mkdir(&self, parent: InodeId, name: &OsStr, mode: FileMode) -> Result<InodeId>;

    /// Create a hard link: new entry `new_name` in `new_parent` pointing to `inode`.
    fn link(&self, inode: InodeId, new_parent: InodeId, new_name: &OsStr) -> Result<()>;

    /// Create a symbolic link containing `target` at `parent/name`.
    fn symlink(&self, parent: InodeId, name: &OsStr, target: &OsStr) -> Result<InodeId>;

    /// Read the target of a symbolic link.
    fn readlink(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;

    /// Create a device special file (block/char device, FIFO, or socket).
    fn mknod(&self, parent: InodeId, name: &OsStr, mode: FileMode, dev: DevId) -> Result<InodeId>;

    /// Remove a directory entry (unlink for files, rmdir for empty directories).
    fn unlink(&self, parent: InodeId, name: &OsStr) -> Result<()>;

    /// Remove an empty directory. Separate from unlink for POSIX semantics:
    /// `unlink()` on a directory returns EISDIR; `rmdir()` on a file returns ENOTDIR.
    fn rmdir(&self, parent: InodeId, name: &OsStr) -> Result<()>;

    /// Rename/move a directory entry, possibly across directories.
    /// `flags` supports RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT
    /// (Linux renameat2 semantics, required for overlayfs).
    fn rename(
        &self,
        old_parent: InodeId, old_name: &OsStr,
        new_parent: InodeId, new_name: &OsStr,
        flags: RenameFlags,
    ) -> Result<()>;

    /// Get inode attributes (size, mode, timestamps, link count).
    fn getattr(&self, inode: InodeId) -> Result<InodeAttr>;

    /// Set inode attributes (chmod, chown, utimes).
    fn setattr(&self, inode: InodeId, attr: &SetAttr) -> Result<()>;

    /// List extended attributes on an inode.
    fn listxattr(&self, inode: InodeId, buf: &mut [u8]) -> Result<usize>;

    /// Get an extended attribute value.
    fn getxattr(&self, inode: InodeId, name: &OsStr, buf: &mut [u8]) -> Result<usize>;

    /// Set an extended attribute value.
    fn setxattr(&self, inode: InodeId, name: &OsStr, value: &[u8], flags: XattrFlags)
        -> Result<()>;

    /// Remove an extended attribute.
    fn removexattr(&self, inode: InodeId, name: &OsStr) -> Result<()>;
}

/// File data operations (open, read, write, sync, allocate, close).
pub trait FileOps: Send + Sync {
    /// Called when a file is opened. Allows the filesystem to initialize per-open
    /// state (NFS delegation, device state, lock state). Returns a filesystem-private
    /// context value stored in the file descriptor.
    fn open(&self, inode: InodeId, flags: OpenFlags) -> Result<u64>;

    /// Called when the last file descriptor referencing this open file is closed.
    /// Filesystem releases per-open state (flock release-on-close, NFS delegation
    /// return, device cleanup). `private` is the value returned by `open()`.
    fn release(&self, inode: InodeId, private: u64) -> Result<()>;

    /// Read data from a file at the given offset. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn read(&self, inode: InodeId, private: u64, offset: u64, buf: &mut [u8]) -> Result<usize>;

    /// Write data to a file at the given offset. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn write(&self, inode: InodeId, private: u64, offset: u64, buf: &[u8]) -> Result<usize>;

    /// Truncate a file to the specified size. This is separate from setattr
    /// because truncation is a complex operation on many filesystems: it must
    /// free blocks/extents, update extent trees, handle COW (ZFS/btrfs),
    /// interact with snapshots, and flush in-progress writes beyond the new
    /// size. The VFS calls truncate after updating the in-memory inode size.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn truncate(&self, inode: InodeId, private: u64, new_size: u64) -> Result<()>;

    /// Flush file data (and optionally metadata) to stable storage.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn fsync(&self, inode: InodeId, private: u64, datasync: bool) -> Result<()>;

    /// Pre-allocate or punch holes in file storage. `private` is the
    /// filesystem-private context value returned by `open()`.
    fn fallocate(&self, inode: InodeId, private: u64, offset: u64, len: u64, mode: FallocateMode) -> Result<()>;

    /// Read directory entries. Returns entries starting from `offset` (an opaque
    /// cookie, not a byte position). The callback is invoked for each entry; it
    /// returns `false` to stop iteration (buffer full). This is the backend for
    /// `getdents64(2)`. `private` is the filesystem-private context value
    /// returned by `open()`.
    fn readdir(
        &self,
        inode: InodeId,
        private: u64,
        offset: u64,
        emit: &mut dyn FnMut(InodeId, u64, FileType, &OsStr) -> bool,
    ) -> Result<()>;

    /// Seek to a data or hole region (SEEK_DATA / SEEK_HOLE, lseek(2)).
    /// Filesystems that do not support sparse files return the file size for
    /// SEEK_DATA at any offset, and ENXIO for SEEK_HOLE at any offset.
    /// `private` is the filesystem-private context value returned by `open()`.
    fn llseek(&self, inode: InodeId, private: u64, offset: i64, whence: SeekWhence) -> Result<u64>;

    /// Map a file region into a process address space. The VFS calls this to
    /// obtain the page frame list; the actual page table manipulation is done
    /// by isle-core (Section 12). Filesystems that do not support mmap (e.g.,
    /// procfs, sysfs) return ENODEV. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn mmap(&self, inode: InodeId, private: u64, offset: u64, len: usize, prot: MmapProt) -> Result<MmapResult>;

    /// Handle a filesystem-specific ioctl. The VFS dispatches generic ioctls
    /// (FIOCLEX, FIONREAD, etc.) itself; only unrecognized ioctls reach the
    /// filesystem driver. Returns ENOTTY for unsupported ioctls. `private` is
    /// the filesystem-private context value returned by `open()`.
    fn ioctl(&self, inode: InodeId, private: u64, cmd: u32, arg: u64) -> Result<i64>;

    /// Splice data between a file and a pipe without copying through userspace.
    /// Backend for splice(2), sendfile(2), and copy_file_range(2). Filesystems
    /// that do not implement this get a generic page-cache-based fallback
    /// provided by the VFS. `private` is the filesystem-private context value
    /// returned by `open()`.
    fn splice_read(
        &self,
        inode: InodeId,
        private: u64,
        offset: u64,
        pipe: PipeId,
        len: usize,
    ) -> Result<usize>;

    /// Splice data from a pipe into a file without copying through userspace.
    /// Reverse direction of splice_read: pipe is the data source, file is the
    /// destination. Backend for splice(2) write direction and vmsplice(2).
    /// Filesystems that do not implement this get a generic page-cache-based
    /// fallback provided by the VFS. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn splice_write(
        &self,
        pipe: PipeId,
        inode: InodeId,
        private: u64,
        offset: u64,
        len: usize,
    ) -> Result<usize>;

    /// Poll for readiness events (POLLIN, POLLOUT, POLLERR). Regular files
    /// always return ready; special files (pipes, device nodes, eventfd)
    /// implement blocking semantics. `private` is the filesystem-private
    /// context value returned by `open()`.
    fn poll(&self, inode: InodeId, private: u64, events: PollEvents) -> Result<PollEvents>;
}

/// Dentry (directory entry) lifecycle operations.
/// Most filesystems use the default VFS implementations. Only network and
/// clustered filesystems need custom implementations (primarily d_revalidate).
pub trait DentryOps: Send + Sync {
    /// Revalidate a cached dentry. Called before using a cached dentry to verify
    /// it is still valid. Returns true if the dentry is still valid, false if
    /// the VFS should discard it and perform a fresh lookup.
    /// Default: always returns true (local filesystems).
    /// Network FS: checks with the server. Clustered FS: checks DLM lease (§31a.6).
    fn d_revalidate(&self, parent: InodeId, name: &OsStr) -> Result<bool> {
        Ok(true)
    }

    /// Custom name comparison. Called during lookup to compare a dentry name
    /// with a search name. Used by case-insensitive filesystems (e.g., VFAT,
    /// CIFS with case folding, ext4 with casefold feature).
    /// Default: byte-exact comparison.
    fn d_compare(&self, name: &OsStr, search: &OsStr) -> bool {
        name == search
    }

    /// Returns a custom hash for this dentry name, or `None` to use the
    /// VFS default (SipHash-1-3 with per-superblock key from `SuperBlock.hash_key`).
    /// Must be consistent with d_compare: if two names are equal per d_compare,
    /// they must produce the same hash.
    ///
    /// The VFS lookup layer calls `d_hash()` and checks the return value.
    /// If `None`, the VFS uses its own SipHash-1-3 with the per-superblock
    /// random key directly, without requiring filesystem involvement. This
    /// matches Linux's pattern where `d_hash` is only invoked when
    /// `dentry->d_op->d_hash` is non-NULL.
    ///
    /// Filesystems with custom hash requirements (e.g., case-insensitive)
    /// override this to return `Some(hash_value)` using their own algorithm —
    /// they never see the SipHash key. The per-superblock key is managed by
    /// the VFS, not exposed to filesystem implementations.
    fn d_hash(&self, name: &OsStr) -> Option<u64> {
        None
    }

    /// Called when a dentry's reference count drops to zero (dentry enters
    /// the unused LRU list). Filesystem can veto caching by returning false.
    fn d_delete(&self, inode: InodeId, name: &OsStr) -> bool {
        true // default: allow LRU caching
    }

    /// Called when a dentry is finally freed from the cache.
    fn d_release(&self, inode: InodeId, name: &OsStr) {}
}

/// Inode attribute structure — returned by getattr(), compatible with
/// Linux statx(2) for full metadata exposure.
pub struct InodeAttr {
    /// Bitmask of valid fields (STATX_* flags). Filesystems set only
    /// the bits for fields they actually populate.
    pub mask: u32,

    pub mode: u16,        // File type and permissions
    pub nlink: u32,       // Hard link count
    pub uid: u32,         // Owner UID
    pub gid: u32,         // Group GID
    pub ino: u64,         // Inode number
    pub size: u64,        // File size in bytes
    pub blocks: u64,      // 512-byte blocks allocated
    pub blksize: u32,     // Preferred I/O block size

    // Timestamps with nanosecond precision
    pub atime_sec: i64,   // Last access
    pub atime_nsec: u32,
    pub mtime_sec: i64,   // Last modification
    pub mtime_nsec: u32,
    pub ctime_sec: i64,   // Last status change
    pub ctime_nsec: u32,
    pub btime_sec: i64,   // Creation time (birth time)
    pub btime_nsec: u32,

    pub rdev: u64,        // Device ID (for device special files). Encodes major:minor as (major << 32) | minor. The Linux compat layer (Section 20) splits these into separate u32 major/minor fields for statx() responses.
    pub dev: u64,         // Device ID of containing filesystem. Encodes major:minor as (major << 32) | minor. The Linux compat layer (Section 20) splits these into separate u32 major/minor fields for statx() responses.
    pub mount_id: u64,    // Mount identifier (STATX_MNT_ID, since Linux 5.8)
    pub attributes: u64,  // File attributes (STATX_ATTR_* flags)
    pub attributes_mask: u64, // Supported attributes mask

    // Direct I/O alignment (STATX_DIOALIGN, since Linux 6.1)
    pub dio_mem_align: u32,    // Required alignment for DIO memory buffers
    pub dio_offset_align: u32, // Required alignment for DIO file offsets

    // Subvolume identifier (STATX_SUBVOL, since Linux 6.10; btrfs, bcachefs)
    pub subvol: u64,

    // Atomic write limits (STATX_WRITE_ATOMIC, since Linux 6.11)
    pub atomic_write_unit_min: u32,  // Min atomic write size (power-of-2)
    pub atomic_write_unit_max: u32,  // Max atomic write size (power-of-2)
    pub atomic_write_segments_max: u32, // Max segments in atomic write
    pub atomic_write_unit_max_opt: u32, // Optimal max atomic write size (STATX_WRITE_ATOMIC, since Linux 6.13)

    // Direct I/O read alignment (STATX_DIO_READ_ALIGN, since Linux 6.14)
    pub dio_read_offset_align: u32,  // DIO read offset alignment (0 = use dio_offset_align)
}

Linux comparison: Linux's VFS uses struct super_operations, struct inode_operations, struct file_operations, and struct dentry_operations — C structs of function pointers (Linux's file_operations alone has 30+ methods). ISLE's trait-based design serves the same purpose but with Rust's safety guarantees: a filesystem that forgets to implement fsync is a compile-time error, not a null pointer dereference at runtime. The trait methods above cover the operations needed for POSIX compatibility; rarely-used operations (e.g., fiemap, copy_file_range with cross-filesystem support) are handled by generic VFS fallback code that calls the core read/write/fallocate methods.

27a.2 Dentry Cache

The dentry (directory entry) cache is the performance-critical data structure of the VFS. It maps (parent_inode, name) pairs to child inodes, eliminating repeated disk lookups for path resolution.

Data structure: RCU-protected hash table. Read-side lookups are lock-free — no atomic operations on the read path, only a memory barrier on RCU read lock entry/exit. This matches Linux's dentry cache design, which is similarly RCU-protected for the same performance reasons.

Negative dentries: When a lookup() returns ENOENT, the VFS caches a negative dentry for that (parent, name) pair. Subsequent lookups for the same nonexistent path component return ENOENT immediately without calling into the filesystem driver. This is critical for workloads like $PATH searches where the shell looks for an executable in 5-10 directories, finding it only in one. Without negative dentries, every command invocation would perform 4-9 unnecessary disk lookups.

Eviction: LRU eviction under memory pressure. The dentry cache integrates with isle-core's memory reclaim (Section 13 — Memory Compression Tier, in 02-kernel-core.md) — when the page allocator signals memory pressure, the dentry cache shrinker evicts least-recently-used entries. Negative dentries are evicted preferentially (they are cheaper to re-create than positive dentries).

27a.3 Path Resolution

Path resolution walks the dentry cache component by component. For example, /usr/lib/libfoo.so resolves as: root dentry -> lookup("usr") -> lookup("lib") -> lookup("libfoo.so").

RCU path walk (fast path): The entire resolution is attempted under an RCU read-side critical section. No dentry reference counts are taken, no locks are acquired. If every component is in the dentry cache and no concurrent renames or unmounts are in progress, the entire path resolves with zero atomic operations.

Ref-walk fallback (slow path): If any component is not cached, or if a concurrent mount/rename is detected (via sequence counters), the RCU walk aborts and restarts in ref-walk mode. Ref-walk takes dentry reference counts and inode locks as needed. This two-phase approach is identical to Linux's LOOKUP_RCU -> LOOKUP_LOCKED fallback.

Mount point traversal: When a dentry is flagged as a mount point, resolution crosses into the mounted filesystem's root dentry. The mount table is consulted via RCU lookup (no lock) in the fast path.

Symlink resolution: The VFS follows up to 40 nested symlinks before returning ELOOP. This matches the Linux limit and prevents infinite symlink loops.

Capability checks: At each path component, the VFS calls isle-core via the inter-domain ring to verify that the requesting process holds traverse permission on the directory. The check is performed in isle-core's isolation domain, not in isle-vfs -- isle-vfs cannot bypass permission checks because it has no access to the capability tables (hardware-enforced domain separation; see Section 3 and Section 6).

27a.4 Mount Namespace and Capability-Gated Mounting

Each process belongs to a mount namespace containing its own mount tree.

Mount operations are capability-gated:

Operation Required Capability Scope
mount CAP_MOUNT Mount namespace
bind mount CAP_MOUNT + read access to source Mount namespace + source
remount CAP_MOUNT Mount namespace
umount CAP_MOUNT Mount namespace
pivot_root CAP_MOUNT Mount namespace

CAP_MOUNT is scoped to the calling process's mount namespace — it does not grant mount authority in other namespaces. A container with its own mount namespace can mount filesystems within that namespace without affecting the host.

Mount propagation: Shared, private, slave, and unbindable propagation types, with the same semantics as Linux (MS_SHARED, MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE). This is essential for container runtimes that rely on mount propagation for volume mounts.

Filesystem type registration: Only isle-core can register new filesystem types with the VFS. Filesystem drivers request registration via the inter-domain ring, and isle-core verifies the driver's identity and KABI certification before granting registration.

27a.5 Distribution-Aware VFS Extensions

When filesystems are shared across cluster nodes (Section 31), the VFS must handle cache validity, locking granularity, and metadata coherence across node boundaries. Linux's VFS was designed for local filesystems with network filesystem support bolted on afterward, resulting in several systemic performance problems. ISLE's VFS addresses these by integrating with the Distributed Lock Manager (Section 31a).

Linux Problem Impact ISLE Fix
Dentry cache assumes local validity Remote rename/unlink leaves stale dentries on other nodes Callback-based invalidation: DLM lock downgrade (Section 31a.8) triggers targeted dentry invalidation for affected directory entries only
d_revalidate() on every lookup for network FS Extra round-trip per path component on NFS/CIFS/GFS2 Lease-attached dentries: dentry is valid while parent directory DLM lock is held (Section 31a.6); zero revalidation cost during lease period
Inode-level locking forces false sharing Two nodes writing to different byte ranges of the same file serialize on the inode lock Range locks in VFS: DLM byte-range lock resources (Section 31a.4) allow concurrent operations on different ranges of the same file
No concurrent directory operations mkdir and create in the same directory serialize globally Per-bucket directory locks: hash-based directory formats (ext4 htree, GFS2 leaf blocks) use separate DLM resources per hash bucket
readdir() + stat() = 2N round-trips for N files ls -l on a 1000-file remote directory requires 2001 operations getdents_plus() returning attributes with directory entries (analogous to NFS READDIRPLUS but in-kernel, avoiding the userspace/kernel boundary per entry). getdents_plus() is an ISLE VFS-internal operation (not a new syscall): the VFS's readdir implementation populates both the directory entry and its InodeAttr in a single filesystem callback, caching the attributes for immediate use by a subsequent getattr() / stat() call. Userspace accesses this via the standard getdents64(2) + statx(2) syscalls — the optimization is transparent, eliminating redundant disk or DLM round-trips inside the kernel.
Full inode cache invalidation on lock drop Dropping a DLM lock on an inode discards all cached metadata, even fields that haven't changed Per-field inode validity: mtime/size read from DLM Lock Value Block (Section 31a.3); permissions and ownership from local capability cache; only stale fields refreshed on lock reacquire

Integration with §31a DLM:

  • Dentry lease binding: When the VFS caches a dentry for a clustered filesystem, it records the DLM lock resource that protects the parent directory. The dentry remains valid as long as that lock is held at CR (Concurrent Read) mode or stronger. When the DLM downgrades or releases the lock (due to contention from another node), the VFS receives a callback and invalidates only the affected dentries — not the entire dentry subtree.

  • Range-aware writeback: When a process holds a DLM byte-range lock and writes to pages within that range, the VFS tracks dirty pages per lock range (not per inode). On lock downgrade, only dirty pages within the lock's range are flushed (Section 31a.8). This eliminates the Linux problem where dropping a lock on a 100 GB file requires flushing all dirty pages, even if only 4 KB was modified.

  • Attribute caching via LVB: The VFS reads frequently-accessed inode attributes (i_size, i_mtime, i_blocks) from the DLM Lock Value Block (Section 31a.3) rather than performing a disk read on every lock acquire. The LVB is updated by the last writer on lock release, so readers always get current values at the cost of a single RDMA operation (~3-4 μs) instead of a disk I/O (~10-15 μs for NVMe).


28. ZFS Integration

28.1 Native ZFS and Filesystem Licensing

Linux problem: ZFS can't be merged due to CDDL vs GPL license incompatibility. Users rely on out-of-tree OpenZFS which breaks with kernel updates.

ISLE design: - The kernel is licensed under ISLE's proposed OKLF v1.3 license framework (see Appendix A of 14-roadmap.md, §55-58 for the full specification — OKLF is a novel license being developed for ISLE, not a pre-existing published license): GPLv2 base with the Approved Linking License Registry (ALLR) which explicitly includes CDDL as an approved license. CDDL-licensed code (like OpenZFS) communicates with the kernel via KABI IPC without license conflict (no in-kernel linking occurs). - ZFS is a first-class filesystem driver. Due to CDDL licensing requirements (see Appendix A in 14-roadmap.md), ZFS runs as a Tier 2 driver with KABI process isolation providing the license boundary. The KABI interface ensures stable, well-defined interactions between ZFS and the kernel without requiring in-kernel linking. It benefits from the stable driver ABI, so it won't break with kernel updates. - NFSv4 ACLs are first-class (Section 11.3), so ZFS's native ACL model works natively. - Filesystem KABI interface is rich enough to support ZFS's advanced features: snapshots, send/receive, datasets, native encryption, dedup, special vdevs.

28.2 ZFS Advanced Features

Section 28 establishes that ZFS is a first-class ISLE citizen via KABI process isolation (Tier 2 driver). This section covers advanced ZFS features that benefit from ISLE's architecture: capability-based dataset management, RDMA-accelerated replication, and cluster integration.

Dataset hierarchy as capability objects — ZFS datasets form a hierarchy (pool → dataset → child dataset → snapshot → clone). In ISLE, each dataset is a capability object. The capability token for a dataset encodes the specific operations permitted:

Capability Permits
CAP_ZFS_MOUNT Mount the dataset as a filesystem
CAP_ZFS_SNAPSHOT Create/destroy snapshots of the dataset
CAP_ZFS_SEND Generate a send stream (for replication)
CAP_ZFS_RECV Receive a send stream into this dataset
CAP_ZFS_CREATE Create child datasets
CAP_ZFS_DESTROY Destroy the dataset (highest privilege)

Delegation means transferring a subset of your capabilities to another local entity (a container, a user). A pool administrator holding all capabilities can delegate CAP_ZFS_MOUNT + CAP_ZFS_SNAPSHOT + CAP_ZFS_CREATE for a subtree to a container — the container can mount, snapshot, and create children within its subtree, but cannot destroy the parent dataset or send replication streams. For shared storage across hosts, use clustered filesystems (Section 31) backed by the DLM (Section 31a) over shared block devices (Section 30).

zvol (ZFS volumes) — ZFS volumes are datasets that expose a block device interface instead of a POSIX filesystem. ISLE integrates zvols with isle-block's device-mapper framework — a zvol can serve as the backing store for dm-crypt, dm-mirror, or as an iSCSI LUN (Section 30). This enables ZFS's checksumming, compression, and snapshot capabilities for raw block storage consumers.

zfs send/recv over RDMA — ZFS replication streams (zfs send) are often used for backup, disaster recovery, and dataset migration. In Linux, zfs send | ssh remote zfs recv pushes the stream over TCP (typically SSH-encrypted). ISLE provides a native RDMA transport option: - Uses Section 47.3's RDMA infrastructure - Kernel-to-kernel path: when both source and destination run ISLE, the send stream bypasses userspace entirely — data moves directly from the source ZFS module through RDMA to the destination ZFS module - Zero-copy: send stream data is RDMA READ from source memory, written directly into destination's transaction group - Encryption: if the dataset uses ZFS native encryption, the stream is already encrypted end-to-end. Otherwise, RDMA transport encryption (Section 47.3) protects data in transit

Import/export compatibility — ISLE's ZFS implementation reads and writes the standard ZFS on-disk format (as defined by OpenZFS). Existing zpools created on Linux, FreeBSD, or illumos can be imported by ISLE without modification. Conversely, zpools created by ISLE can be exported and imported on any OpenZFS-compatible system.


29. Block I/O and Volume Management

Linux problem: LVM/mdadm are mature but fragile when a block device disappears momentarily — the volume layer panics or marks the device as failed. A NVMe driver reload that takes 50ms can cascade into a degraded RAID array and an unnecessary multi-hour resync.

ISLE design:

Device-mapper framework — ISLE implements a device-mapper layer in isle-block with standard targets:

Target Description Linux equivalent
dm-linear Simple linear mapping dm-linear
dm-striped Stripe across N devices dm-stripe
dm-mirror Synchronous mirror (RAID-1) dm-mirror
dm-crypt Transparent encryption (AES-XTS) dm-crypt
dm-verity Read-only integrity verification dm-verity
dm-snapshot Copy-on-write snapshots dm-snapshot
dm-thin Thin provisioning with overcommit dm-thin-pool

LVM2 metadata compatibility — ISLE reads the LVM2 on-disk metadata format (PV headers, VG descriptors, LV segment maps) and constructs logical volumes using device-mapper targets. Existing LVM2 volume groups created under Linux are usable without conversion. LVM2 userspace tools (lvm, pvs, vgs, lvs) work unmodified via the standard device-mapper ioctl interface.

Software RAID — RAID levels 0/1/5/6/10 are implemented as device-mapper targets. MD superblock formats (0.90, 1.0, 1.2) are read for compatibility with existing Linux mdadm arrays. mdadm works unmodified.

Recovery-aware volume layer — This is where ISLE diverges meaningfully from Linux. Block device temporary disappearance during Tier 1 driver reload (~50-150ms) does NOT mark the device as failed:

Volume Layer State Machine:
  DEVICE_ACTIVE       → Normal I/O flow
  DEVICE_RECOVERING   → Driver reload in progress, I/O queued
  DEVICE_FAILED       → Device permanently gone, failover/degrade

Transition rules:
  ACTIVE → RECOVERING:  When driver supervisor signals reload start
  RECOVERING → ACTIVE:  When new driver instance signals ready (typical: <100ms)
  RECOVERING → FAILED:  When recovery timeout expires (default: 5 seconds)
  • During DEVICE_RECOVERING, the volume layer pauses I/O in its ring buffer. No requests are failed; they simply wait.
  • RAID resync is NOT triggered for sub-100ms driver reloads — the array stays clean. The volume layer distinguishes "device temporarily gone for driver reload" from "device removed from bus" by checking the driver supervisor state.
  • If the recovery window exceeds the configurable timeout (default 5s), the device transitions to DEVICE_FAILED and normal degraded-mode behavior applies (RAID rebuilds, error returns for non-redundant volumes).
  • dm-verity for verified boot is already designed (Section 22.5).

30. Block Storage Networking

Storage networking protocols that expose remote block devices as local storage. These integrate with ISLE's block layer (isle-block), RDMA infrastructure (Section 47.3), and driver recovery model.

iSCSI Initiator

Tier 1 isle-block module implementing the iSCSI initiator role (RFC 7143): - Session management: login, logout, connection multiplexing, session recovery - SCSI command encapsulation over TCP - CHAP authentication (unidirectional and mutual) - Header and data digests (CRC32C) for integrity - Multi-connection sessions (MC/S) for bandwidth aggregation - Error recovery levels 0, 1, and 2

iSCSI Target

Tier 1 module exposing local block devices as iSCSI LUNs: - LIO-compatible configuration interface (existing targetcli works via compat layer) - ACL-based access control (initiator IQN whitelist + CHAP) - Multiple LUNs per target portal group - SCSI Persistent Reservations (PR) support (required for clustered filesystems)

iSER (iSCSI Extensions for RDMA)

When RDMA fabric is available (InfiniBand, RoCE, iWARP — Section 47.3), iSCSI sessions transparently upgrade to RDMA transport: - Zero-copy data transfer: RDMA READ/WRITE directly between initiator/target memory - Kernel-bypass data path: data moves without CPU involvement - Same iSCSI session management and authentication, different transport - Transparent upgrade: if both ends advertise RDMA capability during login, iSER is negotiated automatically. Applications and management tools see a standard iSCSI session.

NVMe-oF Initiator (Host)

Tier 1 isle-block module implementing the NVMe over Fabrics host side (NVM Express 2.0, NVMe TCP Transport Specification TP 8000, NVMe/RDMA part of original NVMe-oF specification June 2016):

  • Discovery: NVMe-oF discovery protocol (well-known discovery NQN) — initiator queries a discovery controller to enumerate available subsystems and transport addresses. Supports Discovery Log Page, referrals, and persistent discovery connections, unique discovery controller identification (TP 8013a).
  • NVMe/TCP transport: NVMe commands encapsulated in TCP (NVMe TCP Transport Specification, TP 8000, widely deployed). Lighter than iSCSI — no SCSI translation layer, native NVMe command set. Supports header and data digests (CRC32C), and TLS 1.3 for in-transit encryption (TP 8011).
  • NVMe/RDMA transport: NVMe commands over RDMA (InfiniBand, RoCE, iWARP). Capsule commands sent via RDMA SEND, data transferred via RDMA READ/WRITE — zero-copy, kernel-bypass. Lowest latency option (~3-5 μs network transport; ~10-20 μs end-to-end including NVMe target processing).
  • Multipath: native NVMe multipath (ANA — Asymmetric Namespace Access). Multiple paths to the same namespace are managed by the NVMe driver itself (not dm-multipath). ANA groups indicate path optimality (optimized, non-optimized, inaccessible). ISLE's NVMe multipath integrates with the recovery-aware volume layer (Section 29) — if a path fails due to driver crash, the volume layer waits for recovery rather than immediately failing over.
  • Namespace management: attach/detach namespaces, resize, format — full NVMe-oF namespace management command set.
  • Zoned namespaces (ZNS): NVMe-oF supports zoned namespaces. ISLE exposes these through the block layer's zone interface, compatible with zonefs and f2fs.

NVMe-oF Target (Subsystem)

Tier 1 module exposing local NVMe devices (or any block device) as NVMe-oF subsystems:

  • Subsystem management: create/destroy NVMe subsystems, each with one or more namespaces backed by local block devices (NVMe, zvol, dm device, or any isle-block device).
  • Transport bindings: simultaneous TCP and RDMA listeners on the same subsystem. Clients connect via whichever transport is available.
  • Access control: per-host NQN ACLs. Each allowed host can be restricted to specific namespaces within the subsystem.
  • ANA groups: configure asymmetric namespace access for multipath. Allows active/passive and active/active configurations.
  • Passthrough mode: for local NVMe devices, optionally pass NVMe commands directly to the hardware (no block layer translation). Provides the lowest-latency target implementation — remote host gets near-local NVMe performance.
  • Configuration interface: nvmetcli-compatible JSON configuration (existing Linux NVMe target management tools work via compat layer).

NVMe-oF over Fabrics — Why It Matters

NVMe-oF is replacing iSCSI in new deployments because it eliminates the SCSI translation layer. iSCSI encapsulates SCSI commands (a protocol designed for parallel buses in 1986) over TCP. NVMe-oF speaks NVMe natively — the same command set used by local NVMe SSDs. This means: - No SCSI CDB translation overhead - Native support for NVMe features (multipath/ANA, zoned namespaces, NVMe reservations) - Simpler protocol state machine (NVMe queue pairs vs iSCSI session/connection/task) - Lower latency at every layer

ISLE supports both because iSCSI remains dominant in existing infrastructure (and iSER makes it competitive on RDMA fabrics), while NVMe-oF is the clear direction for new deployments.

Protocol comparison:

Protocol Transport CPU overhead Latency Bandwidth
iSCSI TCP High (TCP stack + SCSI) ~100μs 10-25 Gbps
iSER RDMA Minimal (zero-copy) ~15-25μs end-to-end (transport only: ~5-10μs; end-to-end with SCSI target: ~15-25μs) Line rate (100+ Gbps)
NVMe-oF/TCP TCP Medium (no SCSI layer) ~15-30μs 25-100 Gbps
NVMe-oF/RDMA RDMA Minimal ~10-20μs end-to-end¹ Line rate

¹ NVMe-oF/RDMA latency breakdown: ~3-5 μs network transport (RDMA) + NVMe target processing. The 3-5 μs figure commonly cited represents RDMA transport latency only; end-to-end I/O latency including NVMe device processing is typically ~10-20 μs.

Recovery advantage — Both iSCSI and NVMe-oF initiators run as Tier 1 drivers with state preservation (Section 9). If an initiator driver crashes: 1. Connection state is checkpointed to the state preservation buffer 2. Driver reloads in ~50-150ms 3. RDMA transports (iSER, NVMe-oF/RDMA): When a driver crashes, the local RNIC's Queue Pair enters Error state, and the remote side's QP also transitions to Error state from retransmission timeouts. QP state cannot be transparently restored from a checkpoint — the QP must be destroyed and re-created (Reset -> Init -> RTR -> RTS). ISLE performs a fast QP re-creation: checkpointed session parameters (remote QPN, GID, LID, PSN, MTU, RDMA capabilities) allow the new QP to be configured without full connection manager negotiation. The remote side detects the QP failure (via async error event or failed RDMA operation) and cooperates in re-establishing the QP pair. Total recovery: ~50-150ms (fast re-creation, not transparent restore), vs. 10-30 seconds for full re-discovery in Linux. 4. TCP transports (iSCSI/TCP, NVMe-oF/TCP): Full TCP connection state cannot be reliably restored after a crash (the remote peer's TCP state has advanced: retransmissions, window adjustments, etc.). Instead, ISLE performs a fast reconnect: the checkpointed session parameters (target portal, ISID, TSIH for iSCSI; NQN, controller ID for NVMe-oF) allow session re-establishment without full discovery. The target accepts the reconnect as a session continuation (iSCSI RFC 7143 §6.3.5 session reinstatement; NVMe-oF controller reconnect). I/O commands in flight are retried by the block layer. Total recovery: ~200-500ms (vs. 10-30 seconds for full re-discovery in Linux).

In Linux, an initiator crash requires full session re-establishment: TCP/RDMA reconnection, login/connect, LUN/namespace re-discovery, and filesystem remount. This can take 10-30 seconds and may cause I/O errors visible to applications.

Multipath — Two multipath models coexist: - iSCSI: dm-multipath integration with the recovery-aware volume layer (Section 29). Multiple iSCSI paths (via different network interfaces or through different target portals) provide redundancy. - NVMe-oF: native NVMe ANA multipath (managed by the NVMe driver, not dm-multipath). ANA state changes are handled in-driver with recovery awareness.

Both models coordinate with the volume state machine — if a path fails due to driver crash (not network failure), the volume layer waits for driver recovery rather than immediately failing over.


31. Clustered Filesystems

Shared-disk filesystems where multiple nodes access the same block device simultaneously, coordinated by a distributed lock manager (DLM).

Linux problem — GFS2 and OCFS2 require a complex multi-daemon stack: - Corosync: cluster membership and messaging - Pacemaker: resource manager and fencing coordinator - DLM: distributed lock manager (kernel module + userspace daemon) - Fencing agent: STONITH (Shoot The Other Node In The Head) — kills unresponsive nodes to prevent split-brain corruption

These components are developed by different teams, have different configuration languages, and interact in subtle ways. Diagnosing failures requires understanding all four components and their interactions. A single daemon crash can fence the entire node.

ISLE design — The cluster infrastructure from Section 47 provides the foundation. ISLE integrates these components into a coherent architecture:

DLM over RDMA — The DLM (Section 31a) uses Section 47.3's RDMA transport for lock operations. Lock grant/release round-trip is ~3-5μs over RDMA (vs ~30-50μs over TCP in Linux's DLM). This directly impacts filesystem performance — every metadata operation (create, rename, delete, stat) requires at least one DLM lock. At 3-5μs per lock, clustered filesystem metadata operations approach local filesystem performance. See Section 31a for the full DLM design, including RDMA-native lock protocols, lease-based extension, batch operations, and recovery.

Fencing — When a node becomes unresponsive, the cluster must fence it (prevent it from accessing shared storage) before allowing other nodes to recover its locks: - IPMI/BMC fencing: power-cycle the node via out-of-band management - SCSI-3 Persistent Reservations: revoke the node's reservation on the shared storage device — the storage controller itself blocks I/O from the fenced node - Same mechanisms as Linux, but integrated into Section 47.11's cluster membership protocol rather than requiring a separate Pacemaker/STONITH stack

Quorum — Inherits from Section 47.11's split-brain handling. A partition with fewer than quorum nodes self-fences (stops accessing shared storage) to prevent data corruption.

GFS2 compatibility — Read the GFS2 on-disk format, implemented as an isle-vfs module: - Resource groups, dinodes, journaled metadata - GFS2 DLM lock types mapped to DLM lock modes (Section 31a.2) - Journal recovery for failed nodes - Existing GFS2 volumes can be mounted by ISLE without reformatting

OCFS2 compatibility — Similar approach: read OCFS2 on-disk format, implement as an isle-vfs module. Lower priority than GFS2.

Recovery advantage — This is where ISLE's architecture fundamentally changes clustered filesystem behavior: - Linux: if a node's storage driver crashes, the DLM loses heartbeat from that node. Fencing kicks in — the node is killed (power-cycled or SCSI-3 PR revoked). After reboot (~60s), the node must rejoin the cluster, replay its journal, and re-acquire locks. Other nodes are blocked on any locks held by the crashed node until fencing and recovery complete. - ISLE: if a node's storage driver crashes, the driver recovers in ~50-150ms (Tier 1 reload). The DLM heartbeat continues throughout (heartbeat is in isle-core, not the storage driver). The node stays in the cluster. Its locks remain valid. No fencing, no journal replay, no lock recovery. Other nodes never notice.

This transforms clustered filesystem reliability from "minutes of disruption per failure" to "50ms blip per failure." See Section 31a.12 for detailed recovery comparison.


31a. Distributed Lock Manager

The Distributed Lock Manager (DLM) is a first-class kernel subsystem in isle-core that provides cluster-wide lock coordination for shared-disk filesystems (Section 31), distributed applications, and any kernel subsystem requiring cross-node synchronization. It implements the VMS/DLM lock model — the same model used by Linux's DLM, GFS2, OCFS2, and VMS clustering.

The DLM lives in isle-core (not a separate daemon or Tier 1 driver). This is a deliberate architectural choice: lock state survives Tier 1 driver restarts, DLM heartbeat continues during storage driver reloads, and there are zero kernel/userspace boundary crossings for lock operations.

31a.1 Design Overview and Linux Problem Statement

Linux's DLM implementation suffers from seven systemic problems that limit clustered filesystem performance. Each problem stems from architectural decisions made when the Linux DLM was designed for 1 Gbps Ethernet and 4-node clusters in the early 2000s. ISLE's DLM addresses each problem by design:

# Linux Problem Impact ISLE Fix
1 Global recovery quiesce — DLM stops ALL lock activity cluster-wide during any node failure recovery Seconds of cluster-wide stall; all nodes blocked, not just those sharing resources with the dead node Per-resource recovery: only resources mastered on the dead node are affected; all other lock operations continue uninterrupted (Section 31a.11)
2 TCP lock transport (~30-50 μs per lock operation) Orders of magnitude slower than hardware allows; metadata-heavy workloads bottleneck on lock latency RDMA-native: Atomic CAS for uncontested locks (~3-5 μs including confirmation, zero remote CPU on CAS path), RDMA Send for contested locks (~5-8 μs) (Section 31a.5)
3 No lock batching — each lock request is a separate network round-trip rename() requires 3 locks = 3 round-trips = ~90-150 μs on Linux DLM Batch API: up to 64 locks grouped by master in a single RDMA Write (~5-10 μs total) (Section 31a.5)
4 BAST (Blocking AST) callback storms — O(N) invalidation messages for N holders of a contended resource, including uncontended downgrades Metadata-heavy workloads on large clusters see network saturation from invalidation traffic Lease-based extension: holders extend cheaply via RDMA Write; minimal traffic for uncontended resources — only periodic one-sided RDMA lease renewals that bypass the remote CPU (zero CPU-consuming traffic, vs. Linux BASTs on every downgrade that require CPU processing); contended worst case is still O(K) for K active holders but K ≤ N because expired leases are reclaimed without messaging (Section 31a.6)
5 Separate daemon architecture — corosync + pacemaker + dlm_controld with kernel/userspace boundary crossings Every membership change requires multiple kernel↔userspace transitions; diagnosis requires understanding 4 separate components Integrated in-kernel: membership events from Section 47.11 delivered directly to DLM; single heartbeat source; no userspace daemons (Section 31a.10)
6 Lock holder must flush ALL dirty pages on lock downgrade Dropping an EX lock on a 100 GB file flushes all dirty pages, even if only 4 KB was written Targeted writeback: DLM tracks dirty page ranges per lock; only modified pages within the lock's range are flushed (Section 31a.8)
7 No speculative multi-resource lock acquire GFS2 rgrp allocation: each attempt to lock a resource group is a full round-trip; 8 attempts = 8 × 30-50 μs lock_any_of(N) primitive: single message tries N resources, first available is granted (Section 31a.7)

31a.2 Lock Modes and Compatibility Matrix

The DLM implements the six standard VMS/DLM lock modes. GFS2 uses all six modes — this is not a simplification, it is the minimum required for correct clustered filesystem operation.

/// DLM lock modes, ordered by exclusivity (lowest to highest).
/// Compatible with Linux DLM, GFS2, and OCFS2 expectations.
#[repr(u8)]
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub enum LockMode {
    /// Null Lock — placeholder, compatible with everything.
    /// Used to hold a position in the lock queue without blocking others.
    NL = 0,

    /// Concurrent Read — read access, compatible with all except EX.
    /// Used by GFS2 for inode lookup (reading inode from disk).
    CR = 1,

    /// Concurrent Write — write access, compatible with NL, CR, CW.
    /// Used by GFS2 for writing to a file while others read metadata.
    CW = 2,

    /// Protected Read — read-only, blocks writers.
    /// Used by GFS2 for operations requiring consistent metadata snapshot.
    PR = 3,

    /// Protected Write — write, compatible with NL and CR only.
    /// Used by GFS2 for metadata modification (create, rename, unlink).
    PW = 4,

    /// Exclusive — sole access, incompatible with everything except NL.
    /// Used by GFS2 for operations requiring exclusive inode access.
    EX = 5,
}

Compatibility matrixtrue means the two modes can be held concurrently by different nodes:

NL CR CW PR PW EX
NL yes yes yes yes yes yes
CR yes yes yes yes yes no
CW yes yes yes no no no
PR yes yes no yes no no
PW yes yes no no no no
EX yes no no no no no

This matrix follows the standard VMS/DLM compatibility semantics (OpenVMS Programming Concepts Manual, Red Hat DLM Programming Guide Table 2-2; Linux kernel fs/dlm/lock.c __dlm_compat table). Key points: PW is compatible with NL and CR only (PW is the "update lock" — allows one writer with concurrent readers); CW is compatible with NL, CR, and CW (CW allows concurrent writers); PW and CW are mutually incompatible (PW forbids other writers, including CW holders). The matrix is stored as a compile-time constant lookup table for zero-cost compatibility checks on the lock grant path.

31a.3 Lock Value Blocks (LVBs)

Each lock resource carries a 64-byte Lock Value Block — a small metadata payload piggybacked on lock state. LVBs are the critical optimization that makes clustered filesystem metadata operations efficient.

/// Lock Value Block — 64 bytes of metadata attached to a lock resource.
/// Updated by the last EX/PW holder on downgrade or unlock.
/// Read by PR/CR holders on lock grant.
///
/// MUST be cache-line aligned (`align(64)`). On all target RDMA hardware
/// (ConnectX-5+, EFA, RoCEv2 NICs), a cache-line-aligned 64-byte RDMA Read
/// is performed as a single PCIe transaction, providing de facto atomicity.
/// The alignment is a correctness requirement for the double-read protocol;
/// see the "LVB read consistency" section below.
#[repr(C, align(64))]
pub struct LockValueBlock {
    /// Application-defined data (e.g., inode size, mtime, block count).
    pub data: [u8; 56],

    /// Sequence counter — incremented on every LVB update.
    /// Readers use this to detect stale LVBs after recovery.
    ///
    /// Stored as u64 for alignment and RDMA atomic operation compatibility
    /// (RDMA atomics require 8-byte aligned 8-byte values).
    ///
    /// **Odd/even protocol**: Writers use FAA to increment the counter before
    /// and after writing data. An odd value indicates mid-update (reader should
    /// retry); an even value indicates stable data. The counter is initialized
    /// to 0 (even) on LVB creation.
    ///
    /// **Masking requirement**: Readers MUST mask with `LVB_SEQUENCE_MASK`
    /// (0x0000_FFFF_FFFF_FFFF) before checking parity or comparing values.
    /// The high 16 bits are used for special sentinel values (e.g., INVALID)
    /// and should not be interpreted as part of the sequence counter.
    ///
    /// The 48-bit counter wraps after ~9.2 years at 1M increments/sec
    /// (2^48 / 10^6 ≈ 290 million seconds). See the wrap limitation section
    /// below for handling guidance.
    pub sequence: u64,
}

/// Mask to extract the 48-bit sequence counter from the u64 field.
/// MUST be applied before checking odd/even parity or comparing sequence values.
pub const LVB_SEQUENCE_MASK: u64 = 0x0000_FFFF_FFFF_FFFF;

/// Sentinel value indicating an invalid LVB (after recovery from dead holder).
/// Uses high bits outside the 48-bit sequence space to avoid collision.
/// Readers observing this value must treat the LVB as invalid and refresh
/// from disk before use.
pub const LVB_SEQUENCE_INVALID: u64 = 0xFFFF_0000_0000_0000;

Why LVBs matter: Consider the common case of reading a file's size on a clustered filesystem:

Without LVB:
  Node A holds inode EX lock → writes file → updates size on disk → releases EX
  Node B acquires inode PR lock → reads inode FROM DISK → gets current size
  Cost: 1 lock operation (~3-5 μs) + 1 disk read (~10-15 μs NVMe) = ~13-20 μs

With LVB:
  Node A holds inode EX lock → writes file → writes size to LVB → releases EX
  Node B acquires inode PR lock → reads size FROM LVB (in lock grant message)
  Cost: 1 lock operation (~4-6 μs, LVB included) + 0 disk reads = ~4-6 μs

LVBs eliminate one disk read per metadata operation in the common case. GFS2 uses LVBs to cache inode attributes (i_size, i_mtime, i_blocks, i_nlink) and resource group statistics (free blocks, free dinodes). The VFS layer reads these attributes from the LVB via Section 27a.5's per-field inode validity mechanism.

Note: ISLE uses 64-byte LVBs (56 data + 8 sequence counter), vs Linux's 32 bytes, to accommodate extended metadata including the sequence counter and capability token. GFS2 on-disk format compatibility requires translating between 32-byte and 64-byte LVB formats at the filesystem layer: ISLE's GFS2 implementation packs the standard 32-byte GFS2 LVB fields into the first 32 bytes of the 56-byte data portion, using the remaining 24 bytes for ISLE-specific metadata (sequence validation, capability references). When importing a GFS2 volume from Linux, the filesystem driver zero-extends Linux's 32-byte LVBs into the 64-byte format on first lock acquire.

LVB read consistency: RDMA does not provide atomic reads for 64-byte payloads (RDMA atomics are limited to 8 bytes). When a node reads an LVB via RDMA Read, a concurrent writer could update the LVB mid-read, producing a torn value. The protocol: 1. Reader performs RDMA Read of the full 64-byte LVB. 2. Reader checks sequence counter. If sequence is odd, the writer is mid-update (writers set sequence to an odd value before writing data, then increment to even after). Retry the read. 3. Reader performs a second RDMA Read of the full 64-byte LVB. If every byte (data + sequence) matches the first read, the data is consistent. If any byte differs, retry from step 1. The full-payload comparison (not just the sequence field) catches the case where a writer completes two full updates between the reader's two reads: the 48-bit sequence counter (bits 47:0 of the sequence field) is monotonically increasing (wraps after ~9.2 years at 500K writes/sec — two FAAs per write equals 1M increments/sec — far exceeding practical deployment lifetimes; the correctness argument holds for any deployment shorter than this), so it will differ after any update. The full-payload comparison is a defense-in-depth measure that also detects torn reads where the sequence counter itself was partially updated.

LVB sequence counter wrap limitation: The 48-bit sequence counter (bits 47:0 of the sequence field, masked by LVB_SEQUENCE_MASK) wraps after 2^48 increments. At maximum sustained write rate (500,000 writes/sec = 1,000,000 FAA operations/sec), wrap occurs in approximately 290 million seconds (~9.2 years). During the wrap transition, a reader could observe sequence=2^48-1 on the first read and sequence=0 on the second read, incorrectly concluding that no write occurred between reads (ABA problem on the sequence field). This is an acceptable limitation because: (1) the wrap interval far exceeds typical cluster deployment lifetimes; (2) the full-payload comparison (data + sequence) still detects torn reads even during wrap, since the writer's data changes between FAA operations; (3) production deployments monitor LVB write rate and proactively replace LVB structures approaching the wrap threshold. Clusters with write-intensive workloads exceeding ~50,000 writes/sec on critical LVBs may configure periodic LVB rotation (allocate fresh LVB with new generation counter) to avoid theoretical wrap scenarios in long-running deployments. The sequence counter detects torn reads: the reader retries if the sequence changed during the read. This is a consistency mechanism, not an ABA prevention mechanism — ABA is not applicable because the reader does not perform compare-and-swap on the LVB data. The writer protocol uses RDMA Fetch-and-Add (FAA) for both transitions: FAA(sequence, 1) (now odd = writing) → update data → FAA(sequence, 1) (now even = stable). FAA is a standard RDMA atomic operation, ensuring visibility to concurrent one-sided readers.

LVB single-writer guarantee: The double-read protocol's correctness depends on there being at most one concurrent LVB writer for a given resource. This invariant is provided by the DLM lock itself: only a node holding an EX (Exclusive) or PW (Protected Write) lock on a resource may write to that resource's LVB (per the DLM compatibility matrix in §31a.2). Because the DLM guarantees that at most one node holds EX or PW on a resource at any time, the single-writer invariant is guaranteed by the lock mode rules — no additional coordination is needed. During master failover, LVB writes are suspended until the new master is established and the lock state has been recovered, preventing interleaved writes from two nodes each believing they hold the lock.

RDMA ordering correctness argument: The writer updates the LVB via three RDMA operations posted to a single Reliable Connection (RC) Queue Pair: (1) FAA on sequence, (2) RDMA Write to data bytes, (3) FAA on sequence. Per the InfiniBand Architecture Specification (Vol 1, §9.5), operations within a single RC QP are processed at the responder (target NIC) in posting order. Therefore, when FAA #3 completes, the data Write #2 has already completed at the responder's memory. A reader on a DIFFERENT QP (QP_B) may see operations from QP_A interleaved with its own reads — this is the "no inter-QP ordering" property of RDMA. However, the double-read protocol handles this correctly: if QP_A's operations interleave with QP_B's first Read, the torn value will differ from QP_B's second Read (because the writer changed data and/or sequence between reads), causing a retry. The only remaining concern is whether QP_A's three operations can interleave with BOTH of QP_B's reads to produce identical torn values — this is impossible because the FAA operations on the sequence counter are 8-byte RDMA atomics (always observed atomically, no partial reads), and the sequence counter is monotonically increasing. If the reader's two RDMA Reads see the same sequence value (even), the writer either completed all three operations before both reads (data is consistent) or has not started (data is unchanged). If the sequence values differ between the two reads, the reader retries. The double-read protocol is therefore correct under RDMA's relaxed inter-QP ordering model without requiring explicit fencing between QPs.

RDMA Read atomicity and the SIGMOD 2023 analysis: The InfiniBand Architecture Specification does not formally guarantee that an RDMA Read larger than 8 bytes is delivered atomically. Ziegler et al. (SIGMOD 2023) investigated this question and found that in practice, cache-line-aligned 64-byte RDMA Reads are delivered atomically on all tested hardware — their experiments observed no torn reads for objects that fit within a single cache line. This empirical finding supports our cache-line-aligned LVB design. Nevertheless, the IB spec provides no formal guarantee, and future NICs or memory subsystems could behave differently. The double-read protocol provides defence-in-depth across three complementary layers:

  1. Cache-line alignment (de facto atomicity): The #[repr(C, align(64))] requirement ensures the 64-byte LVB is always cache-line aligned. On all shipping RDMA NICs (ConnectX-5+, AWS EFA, RoCEv2 adapters), the responder NIC reads from the last-level cache or memory controller, which operates at cache-line granularity. A cache-line-aligned 64-byte read therefore arrives from the responder as a single coherent unit — a single PCIe TLP — providing de facto atomicity even without formal IB spec guarantees. This is the primary defence.

  2. Probabilistic defence via double-read: Even if a torn read occurs on a specific platform (e.g., under unusual NUMA topology or memory subsystem conditions), the double-read comparison provides a strong probabilistic defence. For both reads to produce identical torn values, the writer's in-progress modifications must create the EXACT same byte pattern in both torn snapshots — including the monotonically increasing sequence counter. Because the sequence counter changes by exactly 2 per complete write (odd during update, even after), reconstructing the same even sequence value twice from independent torn reads of two different write phases would require an astronomically unlikely alignment of byte delivery from two distinct PCIe transactions. In practice this is negligible.

  3. Two-sided fallback (absolute correctness): After 8 retries the reader falls back to a two-sided RDMA Send to the resource master, which reads the LVB under its local lock and returns a consistent snapshot. This path is unconditionally correct regardless of RDMA read atomicity guarantees or NIC implementation details.

Together these three layers ensure correctness: the first eliminates torn reads on all known hardware, the second provides defence-in-depth on any hypothetical future hardware, and the third guarantees forward progress regardless of RDMA semantics.

Livelock prevention: A continuously-updated LVB could cause a reader to retry indefinitely (the writer keeps changing the sequence counter between the reader's two RDMA Reads). To prevent this, the reader enforces a maximum of 8 retries with exponential backoff (1 μs, 2 μs, 4 μs, ..., 128 μs). If all retries are exhausted, the reader falls back to a two-sided RDMA Send to the resource master, requesting a consistent LVB snapshot. The master reads the LVB under its local lock (preventing concurrent writer updates during the read) and returns the consistent value. This fallback adds ~5-8 μs but guarantees forward progress. In practice, a single retry suffices in over 99% of cases — the 8-retry limit is a safety bound for pathological writer contention.

Typical case: 1 RDMA Read + 1 RDMA Read (64 bytes) = ~3-4 μs total.

After lock master recovery (Section 31a.11), LVBs from dead holders are marked INVALID (sequence counter set to u64::MAX). The next EX or PW holder must refresh the LVB from disk before other nodes can trust it (both EX and PW are write modes that can update the LVB, per the compatibility matrix above).

31a.4 Lock Resource Naming and Master Assignment

Lock resources are identified by hierarchical names that encode the filesystem, resource type, and specific object:

Format: <filesystem>:<uuid>:<type>:<id>[:<subresource>]

Examples:
  gfs2:550e8400-e29b:inode:12345:data      — data lock for inode 12345
  gfs2:550e8400-e29b:inode:12345:meta      — metadata lock for inode 12345
  gfs2:550e8400-e29b:rgrp:42               — resource group 42 allocation lock
  gfs2:550e8400-e29b:journal:3             — journal 3 ownership lock
  gfs2:550e8400-e29b:dir:789:bucket:5      — directory 789 hash bucket 5
  app:mydb:table:users:row:1001            — application-level row lock

Master assignment: Each lock resource is assigned a master node responsible for maintaining the granted/converting/waiting queues. The master is determined by consistent hashing using a virtual-node ring (note: this is deliberately different from DSM home-node assignment in Section 47.5.3, which uses modular hashing — hash % cluster_size — for simpler O(1) lookups; DLM uses consistent hashing because lock resources are more numerous and benefit from minimal redistribution on node changes):

// Each physical node has V virtual nodes on the ring (default V=64).
// The ring is a sorted array of (hash, physical_node_id) pairs.
ring = [(hash(node_0, vnode_0), 0), (hash(node_0, vnode_1), 0), ...,
        (hash(node_N, vnode_V), N)]

master(resource_name) = ring.successor(hash(resource_name)).physical_node_id

When a node joins or leaves the cluster, only ~1/N of total resources are remapped (the resources whose ring position falls between the departed node's virtual nodes and their successors). This is the key property of consistent hashing — unlike modular hashing (hash % cluster_size), which remaps nearly all resources on membership change.

Design choice — consistent hashing vs. directory-based master assignment: Linux's DLM uses modular hashing for lock resource mastering. ISLE uses consistent hashing with virtual nodes because: (1) it is fully distributed with no single point of failure — any node can compute any resource's master locally from the ring (O(log V×N) binary search); (2) membership changes remap only ~1/N of resources instead of ~all. Note that the DLM's consistent hashing is deliberately different from DSM's modular hashing (Section 47.5.3, hash % cluster_size): DSM uses modular hashing for simpler O(1) lookups with full rehash on membership change, while the DLM uses consistent hashing for minimal redistribution on node changes. These are separate protocols with different tradeoffs, not a shared scheme. The tradeoff is that consistent hashing cannot optimize for locality (a node that uses a resource heavily is not preferentially assigned as its master). For workloads where locality matters (e.g., a single node accessing a file exclusively), the DLM's lease mechanism (Section 31a.6) compensates: the holder simply extends its lease without contacting the master, so master location is irrelevant on the fast path.

/// A lock resource managed by the DLM.
pub struct DlmResource {
    /// Resource name (hierarchical, variable-length).
    pub name: ResourceName,

    /// Node ID of the resource master.
    pub master: NodeId,

    /// Lock Value Block for this resource.
    pub lvb: LockValueBlock,

    /// Granted queue — locks currently held.
    /// Intrusive linked list: DlmLock nodes are allocated from a per-lockspace
    /// slab allocator (fixed-size, no heap resizing on the lock grant path).
    pub granted: IntrusiveList<DlmLock>,

    /// Converting queue — locks being converted (upgrade/downgrade).
    /// Processed in FIFO order before the waiting queue.
    pub converting: IntrusiveList<DlmLock>,

    /// Waiting queue — new lock requests waiting for compatibility.
    pub waiting: IntrusiveList<DlmLock>,

    /// Pending CAS confirmations (Section 31a.5). When remote nodes acquire a
    /// lock via RDMA CAS but have not yet sent the confirmation RDMA Send, this
    /// field tracks the expected confirmations. The master defers processing new
    /// incompatible-mode requests against this resource until all confirmations
    /// arrive or time out. A bounded collection is required — not Option<PendingCas>
    /// — because shared-mode CAS operations (e.g., PR acquires) allow multiple
    /// nodes to win concurrently (each successive shared-mode CAS increments
    /// holder_count). For exclusive-mode CAS (EX, PW), at most one entry exists.
    pub pending_cas: ArrayVec<PendingCas, MAX_CLUSTER_NODES>,
}

/// Tracks a pending CAS confirmation for a DlmResource.
pub struct PendingCas {
    /// Node that performed the CAS.
    pub node: NodeId,
    /// Lock mode the node acquired.
    pub mode: LockMode,
    /// Sequence value in the CAS word after the acquire (for timeout reset).
    pub post_cas_sequence: u64,
    /// Timestamp when the CAS was detected (for 500 μs timeout).
    pub detected_at_ns: u64,
}

// Note on allocation strategy: DlmLock nodes are allocated from a per-lockspace
// slab allocator (isle-core §12.2). The slab pre-allocates DlmLock-sized objects
// and grows in page-sized chunks, so individual lock grant/release operations
// never trigger the general-purpose heap allocator. This ensures bounded latency
// on the contested lock path. The intrusive list avoids the pointer indirection
// and dynamic resizing of VecDeque/Vec.

/// A single lock held or requested by a node.
pub struct DlmLock {
    /// Node that owns this lock.
    pub node: NodeId,

    /// Requested/granted lock mode.
    pub mode: LockMode,

    /// Process ID on the owning node (for deadlock detection).
    pub pid: u32,

    /// Flags (NOQUEUE, CONVERT, CANCEL, etc.).
    pub flags: LockFlags,

    /// Timestamp for ordering and deadlock victim selection.
    pub timestamp_ns: u64,
}

31a.5 RDMA-Native Lock Operations

All DLM operations use Section 47.3's RDMA transport. Four protocol flows cover the full lock lifecycle:

1. Uncontested acquire (RDMA Atomic CAS, ~3-5 μs full operation)

When a resource has no current holders or only compatible holders, the requesting node can acquire the lock via RDMA Atomic Compare-and-Swap on the master's lock state word — a 64-bit value encoding the current lock state:

/// 64-bit lock state word, laid out for RDMA Atomic CAS.
/// Stored in master's RDMA-accessible memory for each DlmResource.
///
///   bits [63:61] = current_mode (3 bits: 0=NL, 1=CR, 2=CW, 3=PR, 4=PW, 5=EX)
///   bits [60:48] = holder_count (13 bits: up to 8191 concurrent holders;
///                   sufficient for MAX_CLUSTER_NODES=64, with margin for
///                   future expansion up to ~128x the cluster size limit)
///   bits [47:0]  = sequence (48 bits: monotonic counter for ABA prevention)
///
/// IMPORTANT: current_mode encodes a SINGLE lock mode. This means the CAS fast
/// path only works for HOMOGENEOUS holder sets — all holders must be in the same
/// mode. When holders have different compatible modes (e.g., CR + PR, or CR + PW),
/// the CAS word cannot represent the mixed state. These transitions MUST use the
/// two-sided RDMA Send path (protocol 2 below), where the master's control thread
/// maintains per-holder mode information in the full DlmResource granted queue.
///
/// This is a deliberate design tradeoff: the CAS fast path covers the most common
/// lock patterns in practice:
///   - EX for exclusive write access (single writer)
///   - PR for shared read access (multiple readers)
///   - CR for concurrent read (e.g., GFS2 inode attribute reads via LVB)
/// Mixed-mode combinations (CR+PR, CR+PW, CR+CW) are valid but uncommon in
/// GFS2 workloads — they arise primarily during mode transitions (one node
/// downgrades while another acquires). The two-sided path at ~5-8μs is still
/// 5-10x faster than Linux's TCP-based DLM.
///
/// ABA safety: 48-bit sequence counter. At 500,000 lock ops/sec on a single
/// resource (sustained maximum), wrap time = 2^48 / 500,000 = ~563 million
/// seconds (~17.8 years). This eliminates ABA as a practical concern.
///
/// The full granted/converting/waiting queues are maintained separately in the
/// master's local memory. The CAS word is a fast-path optimization — it
/// encodes enough state for common homogeneous transitions without remote CPU
/// involvement. The master's granted queue is the authoritative lock state;
/// the CAS word is a cache of that state for the fast path.

CAS fast path cases (homogeneous mode only):

Transition CAS expected CAS desired Ops Notes
Unlocked → EX NL\|0\|seq EX\|1\|seq+1 1 CAS First exclusive holder
Unlocked → PR NL\|0\|seq PR\|1\|seq+1 1 CAS First protected reader
Unlocked → CR NL\|0\|seq CR\|1\|seq+1 1 CAS First concurrent reader
PR → PR (add reader) PR\|K\|seq PR\|K+1\|seq+1 Read + CAS Add same-mode holder
CR → CR (add reader) CR\|K\|seq CR\|K+1\|seq+1 Read + CAS Add same-mode holder
EX → NL (unlock) EX\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
PR → NL (last reader) PR\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
CR → NL (last reader) CR\|1\|seq NL\|0\|seq+1 1 CAS Last holder releases
PR (remove reader) PR\|K\|seq PR\|K-1\|seq+1 Read + CAS K>1, decrement count
CR (remove reader) CR\|K\|seq CR\|K-1\|seq+1 Read + CAS K>1, decrement count
Unlocked → PW NL\|0\|seq PW\|1\|seq+1 1 CAS Single PW holder (PW+PW incompatible)
Unlocked → CW NL\|0\|seq CW\|1\|seq+1 1 CAS First concurrent writer
CW → CW (add writer) CW\|K\|seq CW\|K+1\|seq+1 Read + CAS CW is self-compatible (per §31a.2 matrix)
CW → NL (last writer) CW\|1\|seq NL\|0\|seq+1 1 CAS Last CW holder releases
CW (remove writer) CW\|K\|seq CW\|K-1\|seq+1 Read + CAS K>1, decrement count

Transitions that CANNOT use CAS (require two-sided path): - Any mode conversion (e.g., PR→EX, EX→PR, CR→PW) - Acquiring a mode different from current holders (e.g., CW when current_mode=CR, or PR when current_mode=CW) - Adding a second PW holder (PW is not self-compatible) - These transitions require the master's control thread to evaluate the full compatibility matrix and update per-holder mode tracking in the granted queue.

Requester                                   Master (remote memory)
    |                                            |
    |--- RDMA Atomic CAS (expected=UNLOCKED, --->|
    |    desired=EX|1|seq+1)                     |
    |<-- CAS result (old value) ------------------|
    |                                            |
    If old_value matched expected: lock acquired.|
    Zero remote CPU involvement.                 |
    Raw CAS round-trip: ~2-3 μs.               |
    Full acquire (CAS + confirmation): ~3-5 μs. |

For the Read+CAS path (adding a shared reader when holders exist), the requester first performs an RDMA Read to learn the current state, then a CAS to atomically increment the holder count. Total: 2 RDMA operations (~3-5 μs). CAS failure (due to concurrent modification) triggers retry with the returned value as the new expected value.

Important: The CAS word is an optimization for the uncontested fast path. It does NOT replace the full lock queues maintained in the master's local memory. When a CAS succeeds, the acquiring node MUST send a two-sided RDMA Send to the master confirming its identity (node ID) and the acquired lock mode. The master updates the full granted queue upon receiving this confirmation. If the master does not receive confirmation within ~500 μs (the confirmation timeout), it assumes the CAS winner crashed before completing the acquire and resets the lock state word via its own CAS (restoring the pre-acquire state). The CAS target word includes a generation counter (the 48-bit sequence field) to prevent ABA issues during this reclamation — the master's restoration CAS uses the post-acquire sequence value as the expected value, so a concurrent legitimate acquire by another node will not be clobbered. This confirmation step is a required correctness measure, not an optimization: without it, if the CAS winner crashes before the master processes its queue entry, recovery would iterate the granted queue and find no record of the holder, leaving the lock state word permanently wedged. When a CAS fails (contested lock, incompatible mode), the requester falls back to the two-sided protocol below. The master's control thread is the sole authority for complex operations (conversions, waiters, deadlock detection).

CAS outcome determination and transport failure recovery. RDMA Atomic CAS is a single round-trip operation: the RNIC performs the compare-and-swap on the remote memory and returns the previous value of the target word in the CAS completion. The requester determines the CAS outcome entirely from this return value — if the returned old value matches the expected value, the CAS succeeded and the lock is held. No separate "confirmation response" from the master's CPU is involved in determining CAS success or failure; the RDMA NIC hardware handles the entire operation atomically. This means the requester always knows whether it acquired the lock, as long as the RDMA completion is delivered.

If the RDMA transport itself fails during a CAS operation (e.g., the Queue Pair enters Error state due to a link failure, cable pull, or remote RNIC reset), the requester receives a Work Completion with an error status (not a successful CAS completion). In this case, the CAS may or may not have been applied to the master's memory — the requester cannot distinguish between "CAS was never sent", "CAS was sent but not executed", and "CAS succeeded but the response was lost in transit." The requester must handle this ambiguity:

  1. Assume the CAS may have succeeded. The requester must not retry the CAS blindly (doing so could double-acquire or corrupt the sequence counter).
  2. Query the master via a recovery path. The requester establishes a fresh RDMA connection (or uses a separate TCP fallback if the RDMA fabric is partitioned) and sends a two-sided lock state query to the master's control thread. The master reads its authoritative lock state — the CAS word in registered memory — and responds with the current lock state plus the sequence counter value.
  3. Master's lock word is ground truth. If the CAS word shows the requester's expected post-CAS value (matching mode, holder count, and sequence), the CAS succeeded and the requester proceeds with the confirmation RDMA Send (on the new connection). If the CAS word shows a different state, the CAS either was not applied or was already reclaimed by the master's confirmation timeout (the ~500 μs timeout described above). In either case, the requester starts a fresh lock acquisition attempt.
  4. Interaction with confirmation timeout. If the CAS succeeded but the requester takes longer than ~500 μs to query the master (due to connection re-establishment), the master may have already reclaimed the lock via its confirmation timeout logic. This is safe: the master's reclamation CAS uses the post-acquire sequence value, so if reclamation occurred, the lock word has been reset and the requester's recovery query will see the reset state. The requester then re-acquires normally.

This recovery path is exercised rarely (only on RDMA transport failures, not on normal CAS contention), so its higher latency (~1-5 ms for connection re-establishment + query) does not affect steady-state performance.

Pending CAS confirmation window: Between a successful CAS and the arrival of the confirmation Send, the CAS word and the master's granted queue are temporarily inconsistent — the CAS word shows a lock held, but the granted queue has no entry. During this window, if another node's CAS fails and it falls back to the two-sided path, the master must handle the discrepancy correctly:

  1. When the master receives a two-sided lock request, it checks BOTH the granted queue AND the CAS word state. If the granted queue is empty but the CAS word shows a held lock, the master knows a CAS confirmation is pending.
  2. The master enqueues the incoming request in the waiting queue and defers processing until either: (a) the CAS confirmation arrives (at which point the granted queue is updated and the waiting queue is processed normally), or (b) the confirmation timeout expires (at which point the master resets the CAS word and processes the waiting queue against the now-empty granted queue).
  3. If the pending CAS mode is compatible with the incoming request's mode (per the §31a.2 compatibility matrix), the master grants the incoming request immediately without waiting for the CAS confirmation. The master also updates the CAS word via its own local CAS to reflect the new holder (incrementing holder_count in the CAS word to account for both the pending CAS winner and the newly granted node). The CAS winner's confirmation, when it arrives, simply adds the CAS winner to the already-updated granted queue. This eliminates the blocking window entirely for same-mode shared requests (e.g., multiple concurrent PR acquires), which are the most common contested case.
  4. For incompatible-mode requests, this deferred processing adds at most 500 μs of latency to the second node's request in the worst case (CAS winner crashed). In the normal case, the confirmation arrives within ~1-2 μs (one RDMA Send), so the deferred processing completes almost immediately. A crashed node's 500 μs delay is negligible compared to the 50-200 ms DLM recovery time.
  5. The master tracks pending CAS confirmations with a per-resource pending_cas: ArrayVec<PendingCas, MAX_CLUSTER_NODES> field (see DlmResource struct in Section 31a.4). A bounded collection is required — not Option<PendingCas> — because shared-mode CAS operations (e.g., PR acquires) allow multiple nodes to win concurrently: each successive shared-mode CAS increments the holder_count field embedded in the CAS word and updates the sequence number, so two or more nodes can complete their CAS atomics before any confirmation arrives. The master must reconcile ALL concurrent CAS winners: it reads the final CAS word once all confirmations have arrived (or the polling timeout expires) and uses the holder_count to verify that the number of confirmations received matches the number of nodes that successfully CAS'd. Any node whose confirmation does not arrive within the timeout is treated as crashed and is excluded from the granted queue. For exclusive-mode CAS (EX, PW), at most one node can win — the CAS word format enforces mutual exclusion — so the collection will contain at most one entry in that case. This field is set when the master observes a CAS word change via periodic polling of the CAS word in its registered memory region, and cleared when all confirmations arrive or times out. Note: The master does NOT receive RDMA completion queue notifications for remote CAS operations (one-sided RDMA is CPU-transparent at the responder). Detection relies on the master's targeted polling of CAS words with pending requests only — the master maintains a per-lockspace pending set of resources with outstanding CAS operations, and polls only those CAS words (poll interval: ~100μs per pending resource). Resources with no pending CAS operations are not polled, so the CPU overhead scales with O(pending) not O(total_resources). On a lockspace with 10,000 resources but only 50 with pending CAS operations, polling generates ~500K polls/second — manageable on a single core.

Security: RDMA CAS access to the lock state word is controlled via RDMA memory registration (Memory Regions / MRs). The master registers each lockspace's CAS word array as a separate RDMA MR and distributes the Remote Key (rkey) only to nodes that hold CAP_DLM_LOCK for that lockspace. Capability verification happens at lockspace join time (a two-sided RDMA Send to the master, which checks CAP_DLM_LOCK via isle-core's capability system before returning the rkey). Nodes that lose CAP_DLM_LOCK have their rkey revoked via RDMA MR re-registration (which invalidates the old rkey). This enforces the capability boundary at the RDMA transport layer — a node without the rkey physically cannot issue CAS operations to the lock state words. The rkey is per-lockspace, so CAP_DLM_LOCK scoping (Section 31a.14) maps directly to RDMA access control.

Rkey lifetime and TOCTOU safety: RDMA rkeys are registered for the lifetime of the node's DLM membership in the lockspace, not per-operation. When a node joins a lockspace, the master registers the RDMA Memory Region and returns the rkey; when the node leaves (graceful or fenced), the MR is deregistered and the rkey is invalidated. This eliminates TOCTOU (time-of-check-to-time-of-use) races: a node that passes the capability check at join time retains a valid rkey for all subsequent lock operations until membership ends. Rkey revocation (for CAP_DLM_LOCK loss) uses RDMA MR re-registration, which atomically invalidates the old rkey -- any in-flight CAS using the old rkey will fail with a remote access error, and the node must re-join the lockspace (re-passing the capability check) to obtain a new rkey.

2. Contested acquire (RDMA Send, ~5-8 μs)

When the CAS fails (resource is already locked in an incompatible mode), the requester falls back to a two-sided RDMA Send to the master's control queue pair:

Requester                                   Master
    |                                            |
    |--- RDMA Send (lock request) ------------->|
    |                                [enqueue in waiting list]
    |                                [check compatibility]
    |                                [if compatible: grant]
    |<-- RDMA Send (lock grant + LVB) ----------|
    |                                            |
    Total latency: 2 RDMA round-trips (~5-8 μs).|

The master's kernel thread processes the request, checks compatibility against the granted queue, and either grants immediately or enqueues for later grant.

3. Lock conversion (upgrade/downgrade)

A node holding a lock can convert it to a different mode without releasing and reacquiring. Conversions use the same protocol as contested acquire (RDMA Send to master). The converting queue is processed before the waiting queue — a conversion request from an existing holder takes priority over new requests.

Common conversions: - PR → EX: upgrade from read to write (e.g., before modifying an inode) - EX → PR: downgrade from write to read (triggers targeted writeback, Section 31a.8) - EX → NL: release write lock but keep queue position (for future reacquire)

4. Batch request (up to 64 locks, ~5-10 μs total)

Multiple lock requests destined for the same master are grouped into a single RDMA Write:

Requester                                   Master
    |                                            |
    |--- RDMA Write (batch: 8 lock requests) -->|
    |                                [process all 8]
    |<-- RDMA Send (batch: 8 grants/queued) ----|
    |                                            |
    Total: ~5-10 μs for 8 locks.                |
    Linux DLM: 8 × 30-50 μs = 240-400 μs.      |

Batch requests are critical for operations that require multiple locks atomically. A rename() requires locks on the source directory, destination directory, and the file being renamed — three locks that can be batched into a single network operation when they share the same master.

When batch locks span multiple masters, the requester sends one batch per master in parallel and waits for all grants. Worst case: N masters = N parallel RDMA operations completing in max(individual latencies) rather than sum(individual latencies).

31a.6 Lease-Based Lock Extension

Problem solved: Linux DLM's BAST (Blocking AST) callback storms.

In Linux, when a node requests a lock in a mode incompatible with current holders, the DLM sends a BAST callback to every holder. For a popular file with 100 readers (PR mode), a writer requesting EX mode triggers 100 BAST messages — O(N) network traffic per contention event. On large clusters (64+ nodes), this becomes a significant source of network overhead.

ISLE's lease-based approach:

  • Every granted lock includes a lease duration (configurable per resource type):
  • Metadata locks: 30 seconds default
  • Data locks: 5 seconds default
  • Application locks: configurable (1-300 seconds)

  • Lease extension: Holders extend their lease cheaply via RDMA Write to the master's lease table — a single one-sided RDMA operation that updates a timestamp. No master CPU involvement. Cost: ~1-2 μs per extension.

  • Revocation strategy:

  • Uncontended resource: No revocation needed. Holders extend leases indefinitely. Minimal network traffic for uncontended locks — only periodic one-sided RDMA lease renewals, which do not interrupt the remote CPU (vs. Linux's periodic BAST heartbeats that require CPU processing on every node).
  • Contended resource (incompatible request arrives): Master checks lease expiry for all incompatible holders. If all leases have expired, master grants to new requester immediately. If any leases are active, master sends revocation messages to those holders. For the worst case (EX request on a resource with K active CR/PR holders), this is O(K) revocations — the same as Linux's BAST count. The improvement over Linux is for the common case: uncontended resources have zero CPU-consuming traffic — only one-sided RDMA lease renewals that bypass the remote CPU (Linux BASTs are sent even for uncontended downgrade requests and require CPU processing on the receiving node), and resources where most holders' leases have naturally expired need only revoke the few remaining active holders.
  • Emergency revocation: For locks with NOQUEUE flag (non-blocking), the master immediately checks compatibility and returns EAGAIN if blocked. No revocation attempted.

  • Correctness guarantee: Lease expiry is a sufficient condition for revocation — if a holder fails to extend its lease, the master knows the lock can be safely reclaimed. For contended resources, the fallback to immediate revocation (single targeted message) preserves correctness identically to Linux's BAST mechanism.

  • Clock skew safety: Lease timing is master-clock-relative only. The master is the sole arbiter of lease validity. To handle clock skew between holder and master:

  • Grant messages include the master's absolute expiry timestamp.
  • Holders renew at 50% of lease duration (e.g., 15s for a 30s metadata lease), providing a safety margin larger than any reasonable clock skew (seconds).
  • Holders track the master's clock offset from grant/renewal responses and adjust their renewal timing accordingly.
  • If a holder discovers its lease was revoked (via a failed extension response), it must immediately stop using cached data and flush any dirty pages before reacquiring the lock. This is the hard correctness boundary: the holder's opinion of lease validity does not matter — only the master's.
  • NTP or PTP synchronization is recommended but not required for correctness. The protocol is safe with unbounded clock skew — only the renewal safety margin shrinks, increasing the probability of unnecessary revocations (performance, not correctness).

  • Network traffic reduction: From O(N) BASTs per contention event to O(1) for uncontended resources (no active holders — just clear the lease) and O(K) for contended resources with K active holders. Cluster-wide lock traffic is reduced by orders of magnitude on large clusters.

31a.7 Speculative Multi-Resource Lock Acquire

Problem solved: GFS2 resource group contention.

GFS2 must find a resource group (rgrp) with free blocks before allocating file data. In Linux, this is sequential: try rgrp 0, if locked → full round-trip (~30-50 μs); try rgrp 1, if locked → another round-trip. On a busy cluster with 8 rgrps, worst case is 8 × 30-50 μs = 240-400 μs just to find a free rgrp.

ISLE's lock_any_of() primitive:

/// Request an exclusive lock on ANY ONE of the provided resources.
/// The DLM tries all resources and grants the first available one.
/// Returns the index of the granted resource and the lock handle.
pub fn lock_any_of(
    resources: &[ResourceName],
    mode: LockMode,
    flags: LockFlags,
) -> Result<(usize, DlmLockHandle), DlmError>;

The requester sends a single message listing N candidate resources. The master (or masters, if resources span multiple masters) evaluates each candidate and grants the first one that is available in the requested mode.

Requester                              Master(s)
    |                                       |
    |--- "Lock any of [rgrp0..rgrp7]" ---->|
    |                           [try rgrp0: locked]
    |                           [try rgrp1: locked]
    |                           [try rgrp2: FREE → grant]
    |<-- "Granted: rgrp2" ------------------|
    |                                       |
    Total: ~5-10 μs (single round-trip).   |
    Linux: up to 8 × 30-50 μs = 240-400 μs.|

For resources spanning multiple masters, the requester sends parallel requests to each master. The first grant received is accepted; the requester cancels remaining requests.

31a.8 Targeted Writeback on Lock Downgrade

Problem solved: Linux's "flush ALL pages" on lock drop.

In Linux, when a node holding an EX lock on a GFS2 inode downgrades to PR or releases to NL, the kernel must flush ALL dirty pages for that inode to disk. This is because Linux's page cache has no concept of which pages were dirtied under which lock — the dirty tracking is per-inode, not per-lock-range.

For a 100 GB file where only 4 KB was modified, Linux flushes ALL dirty pages (which could be the entire file if it was recently written). This turns a lock downgrade into a multi-second I/O operation.

ISLE's per-lock-range dirty tracking:

The DLM integrates with the VFS layer (Section 27a.5) to track dirty pages per lock range:

/// Dirty page tracker associated with a DLM lock.
/// Tracks which pages were modified while this lock was held.
pub struct LockDirtyTracker {
    /// Byte range covered by this lock (for range locks).
    /// For whole-file locks: 0..u64::MAX.
    pub range: core::ops::Range<u64>,

    /// Bitmap of dirty pages within the lock's range.
    /// Indexed by (page_offset - range.start) / PAGE_SIZE.
    ///
    /// Uses a two-level sparse bitmap: an array of 64-bit bitmasks (each covering
    /// 64 pages = 256 KB), indexed by a compact sorted array of chunk IDs. This
    /// provides O(1) set/clear per page and O(dirty_chunks) iteration, without
    /// heap allocation for typical cases (≤16 chunks inline, spilling to slab
    /// allocation only for files with widely scattered dirty pages).
    pub dirty_pages: SparseBitmap,
}

Downgrade behavior:

  • EX/PW → PR (downgrade to read): Flush only pages in dirty_pages bitmap. If 4 KB of a 100 GB file was modified, flush exactly 1 page (~10-15 μs for NVMe), not the entire file. PW (Protected Write) follows the same writeback rules as EX, since both are write modes that can dirty pages (per the compatibility matrix in Section 31a.2).
  • EX/PW → NL (release): Flush dirty pages, then invalidate only pages covered by this lock's range. Other cached pages (from other lock ranges or read-only access) remain valid.
  • Range lock downgrade: When a byte-range lock is downgraded, only dirty pages within that specific byte range are flushed. Pages outside the range are untouched.

Cost reduction: From O(file_size) to O(dirty_pages_in_range). For the common case of small writes to large files, this reduces lock downgrade cost by orders of magnitude.

31a.9 Deadlock Detection

Distributed deadlock detection uses a gossip-based wait-for graph:

  • Each node maintains a local wait-for graph of lock dependencies. Vertices are globally unique process identifiers (node_id, pid) — bare PIDs are insufficient because PID 1234 on Node A and PID 1234 on Node B are different processes. Edges represent lock dependencies: process (N1, P) holds lock A, process (N2, Q) waits for lock A → edge (N2, Q) → (N1, P).
  • Every 100 ms, nodes exchange wait-for graph edges with their neighbors (gossip protocol). Edge propagation converges in O(log N) gossip rounds for N nodes. Each gossip message includes the (node_id, pid) tuples for both endpoints of each edge, ensuring no PID aliasing across nodes.
  • Each node runs local cycle detection on its accumulated graph. If a cycle is found, the youngest transaction (highest timestamp) is selected as the victim and receives EDEADLK.
  • Victim selection is configurable: youngest (default), lowest priority, or smallest transaction (fewest locks held).

Zero overhead on fast path: Deadlock detection only activates when a lock request has been waiting for longer than a configurable threshold (default: 5 seconds). Short waits (the common case for contended locks) complete before deadlock detection engages. The gossip protocol runs on a low-priority background thread and uses minimal bandwidth (each gossip message is typically <1 KB).

Latency tradeoff justification: The 5-second activation threshold means a true deadlock waits ~5 seconds before detection begins, which is 1,000,000x the typical lock latency (~5 μs). This is acceptable because: (1) deadlocks are rare in practice -- most lock waits resolve within milliseconds; (2) the alternative (immediate distributed cycle detection on every wait) would add gossip overhead to every contended lock operation, degrading the common-case latency that the DLM is optimized for; (3) the 5-second threshold matches Linux DLM's deadlock detection timeout and is well within application tolerance for the rare deadlock case.

Local fast-path detection: For locks mastered on the same node, the master performs immediate local cycle detection when enqueueing a new waiter -- if the waiter and all holders in the cycle are on the same node, the deadlock is detected in O(edges) time without any network round-trips, typically within microseconds. The 5-second gossip-based detection is only needed for cross-node deadlock cycles, where the wait-for graph edges span multiple nodes and must be aggregated via the gossip protocol.

31a.10 Integration with Cluster Membership (§47.11)

The DLM receives cluster membership events directly from Section 47.11's cluster membership protocol:

  • NodeJoined: New node added to consistent hash ring. Some lock resources are remapped to the new master (~1/N of resources). The new node receives resource state from the old masters.
  • NodeSuspect: Heartbeat missed. DLM begins preparing for potential recovery but does NOT stop lock operations. Current lock holders continue normally.
  • NodeDead: Confirmed node failure. DLM initiates recovery for resources mastered on or held by the dead node (Section 31a.11).
  • NodeLeaving: Graceful departure. Node transfers mastered resources to their new owners before leaving. Zero disruption.

Single heartbeat source: The DLM does NOT run its own heartbeat. It piggybacks on Section 47.11.2's HeartbeatMessage, which already includes per-node liveness information. This eliminates the Linux problem where DLM and corosync can disagree on node liveness — in ISLE, there is exactly one source of truth for cluster membership.

31a.11 Recovery Protocol

Three failure scenarios, each with a targeted recovery flow:

1. Lock holder failure (a node holding locks crashes)

Timeline:
  t=0:    Node B crashes while holding locks on resources R1, R2, R3
  t=300ms: §47.11 heartbeat detects NodeSuspect(B) (3 missed heartbeats at 100ms interval)
  t=1000ms: NodeDead(B) confirmed (10 missed heartbeats)

Recovery (per-resource, NOT global):
  For each resource where B held a lock:
    1. Master removes B's lock from granted queue
    2. If B held EX with dirty LVB: mark LVB as INVALID (sequence = u64::MAX)
    3. Process converting queue, then waiting queue (grant compatible waiters)
    4. If B held journal lock: trigger journal recovery for B's journal

  Resources NOT involving B: completely unaffected. Zero disruption.

  Lease expiry race handling: NodeSuspect is detected at 300ms (3 missed heartbeats),
  but leases may not expire until their full timeout (metadata: 30s, data: 5s). If the
  master attempts to send revocation messages to B during recovery and B is already
  dead (RDMA Send fails), the master does not block indefinitely waiting for B to
  acknowledge revocation. Instead, the master records B as "revocation pending" and
  proceeds with resource recovery immediately — the lease timeout will naturally
  invalidate B's access rights when it expires. For data locks (5s timeout), the
  recovery completes within the lease window; for metadata locks (30s timeout), the
  master may grant new locks on the resource before B's lease expires. This is
  correct because B is confirmed dead at t=1000ms and cannot access the resource.
  The lease timeout provides a safety net in the corner case where NodeDead
  confirmation is delayed beyond the lease duration — if the master cannot confirm
  B's death, B retains access until lease expiry, preserving correctness at the cost
  of temporary unavailability for incompatible lock requests.

2. Lock master failure (the node responsible for a resource's lock queues crashes)

Timeline:
  t=0:    Node M crashes (was master for resources hashing to M)
  t=1000ms: NodeDead(M) confirmed (10 missed heartbeats per §47.11.2)

Recovery:
  1. Consistent hashing reassigns M's resources to surviving nodes.
     (~1/N resources move, distributed across all survivors.)
  2. Each survivor that held locks on M's resources reports its lock
     state to the new master via RDMA Send.
  3. New master rebuilds granted/converting/waiting queues from
     survivor reports.
  4. Lock operations resume for affected resources.

  Timeline: ~50-200ms for affected resources.
  All other resources: unaffected (their masters are alive).

3. Split-brain (network partition divides cluster)

Inherits Section 47.11.3's quorum protocol: - Majority partition: Continues normal DLM operation. Resources mastered on nodes in the minority partition are remapped. - Minority partition: Blocks new EX/PW lock acquisitions to prevent conflicting writes. Existing EX/PW locks are downgraded to PR — the holder retains the lock (avoiding re-acquisition on partition heal) but cannot write. Dirty pages held under the downgraded lock are flushed before the downgrade completes (targeted writeback, Section 31a.8). Existing PR and CR locks remain valid for local cached reads. Lease enforcement is suspended in the minority partition: since masters in the majority partition cannot be reached for lease renewal, lease expiry cannot be used to revoke locks. No new writes are permitted. No data corruption is possible because the minority cannot acquire or hold write locks, and read-only access to stale data is explicitly safe for PR/CR modes at the filesystem level (no on-disk corruption or metadata structure damage, though application-visible staleness is possible (e.g., readdir may return deleted entries or miss new files created on the majority partition)). Applications requiring linearizable reads (e.g., databases with ACID guarantees) may see stale values during the partition; this is inherent to any system that allows minority-partition reads (CAP theorem). DSM integration: The DLM's write-lock downgrade is consistent with the DSM's SUSPECT page mechanism (Section 47.11.3): DSM write-protects SUSPECT pages while allowing reads. Both subsystems independently block writes in the minority partition, providing defense-in-depth. - Partition heals: Minority nodes rejoin. Lock state is reconciled: 1. Minority nodes report their held lock state to the (majority-elected) masters. 2. Masters compare against current granted queues (majority wins for conflicts). 3. Any minority-held locks that conflict with locks granted during the partition are forcibly revoked on the minority nodes (cached data invalidated). 4. Non-conflicting locks are re-validated and lease timers restarted.

Key difference from Linux: NO global recovery quiesce. Linux's DLM stops ALL lock activity cluster-wide while recovering from ANY node failure. This is because Linux's DLM recovery protocol requires a globally consistent view of all lock state before it can proceed — every node must acknowledge the recovery, and no new lock operations can be processed until all nodes agree.

ISLE's DLM recovers per-resource: only resources mastered on or held by the dead node require recovery. The remaining (typically 90%+) of lock resources continue operating without any pause.

31a.12 ISLE Recovery Advantage

The combination of isle-core's architecture and the per-resource DLM recovery protocol creates a fundamentally different failure experience:

Linux path (storage driver crash on Node B):

t=0:      Driver crash
t=0-30s:  Fencing: cluster must confirm B is dead (IPMI/BMC power-cycle
          or SCSI-3 PR revocation). Conservative timeout.
t=30-90s: Reboot: Node B reboots, OS loads, cluster stack starts.
t=90-120s: Rejoin: B rejoins cluster. DLM recovery begins.
          GLOBAL QUIESCE: ALL nodes stop ALL lock operations.
t=120-130s: DLM recovery: all nodes exchange lock state, rebuild queues.
t=130s:    Normal operation resumes.
Total: 80-130 seconds of disruption. ALL nodes affected.

ISLE path (storage driver crash on Node B):

t=0:       Driver crash in Tier 1 storage driver.
t=0:       DLM heartbeat CONTINUES (heartbeat is in isle-core, not the
           storage driver). Cluster does NOT detect a node failure.
t=50-150ms: Driver reloads (Tier 1 recovery, Section 9). State restored
           from checkpoint.
t=150ms:   Driver operational. Lock state was never lost (DLM is in
           isle-core). No fencing needed. No recovery needed.
Total: 50-150ms I/O pause on Node B only. Zero lock disruption.
Zero impact on other nodes.

The difference is architectural: in Linux, the DLM runs in the same failure domain as storage drivers (all are kernel modules that crash together). In ISLE, the DLM is in isle-core — it survives driver crashes. The DLM only needs recovery when isle-core itself fails (which means the entire node is down).

31a.13 Application-Level Distributed Locking

The DLM provides application-visible locking interfaces:

  • flock() on clustered filesystem → transparently maps to DLM lock operations. Applications using flock() for coordination get cluster-wide locking without code changes.
  • fcntl(F_SETLK) byte-range locks → DLM range lock resources. POSIX byte-range locks on clustered filesystems provide true cluster-wide exclusion.
  • Explicit DLM API via /dev/dlm → compatible with Linux's dlm_controld interface. Applications that use libdlm for explicit distributed locking work without modification.
  • flock2() system call (new, ISLE extension) — enhanced distributed lock with:
  • Lease semantics: caller specifies desired lease duration
  • Failure callback: notification when lock is lost due to node failure
  • Partition behavior: configurable (block, release, or fence)
  • Batch support: lock multiple files in a single system call

31a.14 Capability Model

DLM operations are gated by capabilities (Section 11):

Capability Permits
CAP_DLM_LOCK Acquire, convert, and release locks on resources in permitted lockspaces
CAP_DLM_ADMIN Create and destroy lockspaces, configure parameters, view lock state
CAP_DLM_CREATE Create new lock resources (for application-level locking via /dev/dlm)

Lockspaces provide namespace isolation — a container with CAP_DLM_LOCK scoped to its own lockspace cannot interfere with locks in other lockspaces. GFS2 creates a lockspace per filesystem; applications create lockspaces via /dev/dlm.

31a.15 Performance Summary

Operation ISLE Latency Linux DLM Improvement
Uncontested acquire ~3-5 μs (RDMA CAS + confirmation) ~30-50 μs (TCP) ~10-15×
Uncontested acquire + LVB read ~4-6 μs ~100 μs ~20×
Contested acquire (same master) ~5-8 μs (RDMA Send) ~100-200 μs (TCP) ~20-30×
Batch N locks (same master) ~5-10 μs N × 30-50 μs ~N×8×
Lock any of N resources ~5-10 μs N × 30-50 μs (sequential) ~N×8×
Lease extension ~1-2 μs (RDMA Write) N/A (no leases)
Lock holder recovery ~50-200 ms (affected resources only) 5-10 s (global quiesce) ~50×
Lock master recovery ~200-500 ms (affected resources only) 5-10 s (global quiesce) ~20×

Arithmetic basis: RDMA CAS latency is measured at 1.5-2.5 μs on InfiniBand HDR (200 Gb/s) and RoCEv2 (100 Gb/s) in published benchmarks. The full uncontested acquire includes the raw CAS (~2-3 μs) plus the mandatory confirmation RDMA Send (~1-2 μs), totaling ~3-5 μs. RDMA Send/Receive for contested locks adds ~1-2 μs for receive-side processing. Linux DLM TCP latency includes TCP stack processing (~15-20 μs round-trip), DLM lock manager processing (~10-15 μs), and completion notification (~5-10 μs), totaling ~30-50 μs in published GFS2 benchmarks. Note: The Linux DLM runs entirely in-kernel since kernel 2.6; dlm_controld handles only membership events, not lock operations.

31a.16 Data Structures

/// DLM lockspace — namespace for a set of related lock resources.
pub struct DlmLockspace {
    /// Lockspace name (e.g., "gfs2:550e8400-e29b" for a GFS2 filesystem).
    pub name: LockspaceName,

    /// Lock resources in this lockspace.
    /// Sharded concurrent hash map: 256 shards, each with its own RwLock.
    /// Shard = hash(resource_name) & 0xFF. This reduces lock contention from
    /// a single global bottleneck to per-shard contention. Individual lock
    /// operations only hold their shard's RwLock, allowing concurrent access
    /// to resources in different shards. DlmResource entries are allocated
    /// from a per-lockspace slab allocator.
    pub resources: ShardedMap<ResourceName, DlmResource, 256>,

    /// Lease configuration for this lockspace.
    pub lease_config: LeaseConfig,

    /// Deadlock detection state.
    pub wait_for_graph: Mutex<WaitForGraph>,

    /// Statistics counters.
    pub stats: DlmStats,
}

/// Per-lockspace lease configuration.
pub struct LeaseConfig {
    /// Default lease duration for metadata locks.
    pub metadata_lease_ns: u64,

    /// Default lease duration for data locks.
    pub data_lease_ns: u64,

    /// Default lease duration for application locks.
    pub app_lease_ns: u64,

    /// Grace period after lease expiry before forced revocation.
    pub grace_period_ns: u64,
}

/// DLM statistics (per-lockspace, exposed via islefs §41).
pub struct DlmStats {
    /// Total lock operations (acquire + convert + release).
    pub lock_ops: AtomicU64,

    /// Operations served by RDMA CAS fast path (uncontested).
    pub fast_path_ops: AtomicU64,

    /// Operations requiring RDMA Send (contested).
    pub slow_path_ops: AtomicU64,

    /// Batch operations.
    pub batch_ops: AtomicU64,

    /// Lock-any-of operations.
    pub lock_any_ops: AtomicU64,

    /// Deadlocks detected.
    pub deadlocks_detected: AtomicU64,

    /// Recovery events (holder + master).
    pub recovery_events: AtomicU64,
}

31a.17 Licensing

The VMS/DLM lock model is published academic work (VAX/VMS Internals and Data Structures, Digital Press, 1984). The six-mode compatibility matrix, Lock Value Block concept, and granted/converting/waiting queue model are well-documented in public literature and implemented by multiple independent projects (Linux DLM, Oracle DLM, HP OpenVMS DLM). No patent or proprietary IP concerns.

RDMA Atomic CAS and Send/Receive operations are standard InfiniBand/RoCE verbs defined by the IBTA (InfiniBand Trade Association) specification, which is publicly available.


32. Persistent Memory

32.1 The Hardware

CXL-attached persistent memory is coming (Samsung CMM-H with NAND-backed persistence via CXL GPF, SK Hynix). Also: battery-backed DRAM (NVDIMM-N) for enterprise storage. The model: byte-addressable memory that survives power loss.

32.2 Design: DAX (Direct Access) Integration

// isle-core/src/mem/persistent.rs

/// Persistent memory region descriptor.
pub struct PersistentMemoryRegion {
    /// Physical address range.
    pub base: PhysAddr,
    pub size: u64,

    /// NUMA node this persistent memory is attached to.
    pub numa_node: u16,

    /// Technology type (affects performance characteristics).
    pub tech: PmemTechnology,

    /// Is this region backed by a filesystem (DAX mode)?
    pub dax_device: Option<DeviceNodeId>,
}

#[repr(u32)]
pub enum PmemTechnology {
    /// Intel Optane / 3D XPoint (legacy, for existing deployments).
    Optane          = 0,
    /// CXL-attached persistent memory.
    CxlPersistent   = 1,
    /// Battery-backed DRAM (NVDIMM-N).
    BatteryBacked   = 2,
}

32.3 Memory-Mapped Persistent Storage

When a filesystem on persistent memory is mounted with DAX:

Standard file I/O (non-DAX):
  read() → VFS → page cache → memcpy to userspace
  write() → VFS → page cache → writeback → storage device

DAX file I/O:
  read() → VFS → mmap directly to persistent memory → load instruction
  write() → VFS → store instruction → persistent memory
  No page cache. No copies. No writeback.
  CPU load/store talks directly to persistent media.

The memory manager must handle persistent pages differently: - Persistent pages are NOT evictable (they ARE the storage) - fsync() → CPU cache flush (CLWB/CLFLUSH) not block I/O - MAP_SYNC flag ensures metadata (file size, timestamps) is also persistent - Crash consistency: partial writes are visible after reboot (see §32.4)

32.4 Crash Consistency Protocol

Persistent memory stores survive power loss, but CPU caches do not. Without explicit cache flushing, writes to persistent memory may be reordered or lost in the CPU write-back cache. The kernel must enforce a strict persistence protocol:

Persistence primitives (x86):
  CLWB addr     — Write-back cache line, leave line CLEAN but VALID in cache.
                  (Preferred: no performance penalty on subsequent reads.)
  CLFLUSHOPT addr — Flush cache line, INVALIDATE from cache.
                  (Legacy: forces re-fetch on next read.)
  SFENCE        — Store fence. Guarantees all preceding CLWB/CLFLUSHOPT
                  have reached the persistence domain (ADR/eADR boundary).

Correct write sequence for persistent data:
  1. Store data to persistent memory region (mov/memcpy)
  2. CLWB for each modified cache line (64 bytes each)
  3. SFENCE  ← data is now durable
  4. Store metadata update (e.g., committed flag, log tail pointer)
  5. CLWB for metadata cache line(s)
  6. SFENCE  ← metadata is now durable (atomically marks data as committed)

ARM equivalent:
  DC CVAP addr  — Clean data cache to Point of Persistence (ARMv8.2+)
  DSB           — Data Synchronization Barrier

fsync() on a DAX-mounted filesystem translates to CLWB + SFENCE (not block I/O). msync(MS_SYNC) on DAX mappings follows the same path. The kernel provides pmem_flush() and pmem_drain() helpers that abstract the architecture-specific instructions.

32.5 PMEM Error Handling

Persistent memory is physical media and can develop errors (bit rot, wear-out, manufacturing defects). The error model mirrors Linux badblocks:

Error sources:
  1. UCE (Uncorrectable Error) — MCE (Machine Check Exception) on x86,
     SEA (Synchronous External Abort) on ARM.
     CPU receives #MC / abort when reading a poisoned cache line.

  2. ARS (Address Range Scrub) — ACPI background scan discovers latent
     errors before they're read. Results reported via ACPI NFIT.

  3. CXL Media Error — CXL 3.0 devices report media errors via CXL
     event log (Get Event Records command).

Kernel response:
  MCE/SEA on PMEM page:
    1. Mark physical page as HWPoison (same as DRAM MCE path).
    2. Add to per-region badblocks list.
    3. If a process has the page mapped:
       a. DAX mapping → deliver SIGBUS (BUS_MCEERR_AR) with fault address.
       b. Process can handle SIGBUS and skip/retry the corrupted region.
    4. Filesystem (ext4/xfs DAX) is notified via dax_notify_failure().
       Filesystem marks affected file range as damaged.

  ARS/CXL background error:
    1. ACPI notification or CXL event interrupt.
    2. Add to badblocks list.
    3. If mapped: deliver SIGBUS (BUS_MCEERR_AO — action optional).
    4. Userspace can query badblocks via /sys/block/pmemN/badblocks.

32.6 Integration with Memory Tiers

Persistent memory becomes another level in the memory hierarchy. Note: the "Memory Level" numbering below refers to the memory distance hierarchy, NOT the driver isolation tiers (Tier 0/1/2) used elsewhere in this architecture.

Existing memory levels (Section 43, Section 47.6):
  Level 0: Per-CPU caches
  Level 1: Local DRAM
  Level 2: Remote DRAM (cross-socket)
  Level 3: CXL pooled memory
  ...

Extended:
  Level N: Persistent memory (CXL-attached or NVDIMM)
    Properties:
      - Byte-addressable (like DRAM)
      - Survives power loss (like storage)
      - Higher latency than DRAM (~200-500ns vs ~80ns)
      - Lower bandwidth than DRAM
      - Cannot be evicted (it IS the backing store)

32.7 Linux Compatibility

Linux persistent memory interfaces are preserved:

/dev/pmem0, /dev/pmem1:       Block device interface (libnvdimm)
/dev/dax0.0, /dev/dax1.0:    Character DAX device (devdax)
mount -o dax /dev/pmem0 /mnt: DAX-mounted filesystem
mmap() with MAP_SYNC:         Guaranteed persistence of metadata

Optane Discontinuation Note:

Intel discontinued Optane persistent memory products in 2022. The persistent memory design in this section is hardware-agnostic — it applies to any byte-addressable persistent medium. CXL 3.0 Type 3 devices with persistence (battery-backed or inherently persistent media) are the expected successor. The PmemTechnology enum includes CxlPersistent for this reason. The DAX path, cache flush protocol, and error handling are technology-independent.

PMEM Namespace Discovery:

Persistent memory regions are discovered via:

  • ACPI NFIT (NVDIMM Firmware Interface Table): For NVDIMM-N and legacy Optane. The NFIT describes each PMEM region's physical address range, interleave set, and health status.
  • CXL DVSEC (Designated Vendor-Specific Extended Capability): For CXL-attached persistent memory. CXL devices advertise memory regions via PCIe DVSEC structures. The kernel's CXL driver enumerates regions and creates /dev/daxN.M device nodes.
  • Namespace management: Regions are partitioned into namespaces via ndctl (userspace tool) using the Linux-compatible namespace management ioctl interface. ISLE implements the same ioctls via isle-compat.

32.8 Performance Impact

Zero overhead for systems without persistent memory. When persistent memory is present: DAX I/O is faster than standard I/O (eliminates page cache copies and writeback). Performance improves.

35.5 Filesystem Repair and Consistency Checking

Filesystem repair (fsck, xfs_repair, btrfs check) is handled by existing Linux userspace utilities running against ISLE's block device interface. ISLE does not implement in-kernel repair paths — the standard Linux repair tools are unmodified userspace binaries that interact with block devices via standard syscalls (open, read, write, ioctl). Since ISLE implements the complete block device interface (Section 29) and the relevant filesystem syscalls (Section 20), these tools work unchanged:

  • e2fsck / fsck.ext4 for ext4 repair
  • xfs_repair for XFS repair
  • btrfs check / btrfs scrub for btrfs repair (btrfs scrub runs online)
  • ZFS self-heals via block-level checksums (Section 28); zpool scrub is the equivalent of fsck for ZFS

No kernel-side changes are needed to support these tools. The only ISLE-specific consideration is that filesystem drivers should expose consistent BLKFLSBUF and BLKRRPART ioctl behavior matching Linux, as some repair tools use these to synchronize cache state.

35.6 SCSI-3 Persistent Reservations

SCSI-3 Persistent Reservations (PR) are required for shared-storage cluster fencing (Section 31). ISLE's block I/O layer implements the following PR commands as ioctls on block devices:

  • PR_REGISTER / PR_REGISTER_AND_IGNORE: register a reservation key with the storage target. Each node registers a unique key (derived from node ID).
  • PR_RESERVE: acquire a reservation (Write Exclusive, Exclusive Access, or their "Registrants Only" variants).
  • PR_RELEASE: release a held reservation.
  • PR_CLEAR: clear all registrations and reservations.
  • PR_PREEMPT / PR_PREEMPT_AND_ABORT: preempt another node's reservation (used for fencing — a surviving node preempts the fenced node's key).

These map to SCSI PR IN / PR OUT commands (SPC-4) for SCSI/SAS devices and to NVMe Reservation Register/Acquire/Release/Report commands for NVMe devices. The block layer translates between the common ioctl interface and the device-specific command set. The fencing integration with Section 47.11's membership protocol uses PR_PREEMPT_AND_ABORT to revoke a dead node's storage access before recovering its DLM locks.


33. Computational Storage

33.1 Problem

NVMe Computational Storage Devices (CSDs) can run compute on the storage device: filter, aggregate, search, compress — without moving data to the host CPU.

33.2 Design: CSD as AccelBase Device

A CSD naturally fits the accelerator framework (Section 42). It's a device with local memory (flash) and compute capability (embedded processor):

// Extends AccelDeviceClass (Section 42)

#[repr(u32)]
pub enum AccelDeviceClass {
    Gpu             = 0,
    GpuCompute      = 1,
    Npu             = 2,
    Tpu             = 3,
    Fpga            = 4,
    Dsp             = 5,
    MediaProcessor  = 6,
    /// Computational Storage Device.
    /// "Local memory" = flash storage on the device.
    /// "Compute" = embedded processor running submitted programs.
    ComputeStorage  = 7,
    Other           = 255,
}

Note: The AccelDeviceClass enum is canonically defined in Section 42.1 (11-accelerators.md). The ComputeStorage variant (value 7) must be added to the canonical definition to support computational storage devices.

33.3 CSD Command Submission

Standard NVMe read (move data to compute):
  Host CPU ← 1 TB data ← NVMe SSD
  Host CPU processes 1 TB → produces 1 MB result
  Total data moved: 1 TB

CSD compute (move compute to data):
  Host CPU → submit "grep pattern" → CSD
  CSD processes 1 TB internally → produces 1 MB result
  Host CPU ← 1 MB ← CSD
  Total data moved: 1 MB (1000x reduction)

The CSD accepts commands via the AccelBase vtable: - create_context: allocate CSD execution context - submit_commands: submit a compute program (filter, aggregate, map, etc.) - poll_completion: check if computation is done - Results returned via DMA to host memory

33.4 CSD Security Model

CSDs run arbitrary compute programs on the device's embedded processor. The kernel must enforce access boundaries:

Capability-gated namespace access:
  1. Each NVMe namespace has an owner (cgroup or capability).
  2. CSD compute programs can ONLY access namespaces granted to
     the submitting process's capability set.
  3. Cross-namespace access (e.g., join across two datasets on
     different namespaces) requires capabilities for BOTH namespaces.
  4. The CSD driver enforces this BEFORE submitting to hardware
     via the NVMe Computational Storage command set.

Program validation:
  - CSD programs are opaque to the kernel (device-specific bytecode).
  - The kernel does NOT inspect or validate program contents.
  - Trust boundary: the NVMe device enforces isolation between
    namespaces at the hardware level (NVMe namespace isolation).
  - If the CSD hardware lacks namespace isolation, the kernel
    treats the device as single-tenant (only one cgroup at a time).

DMA buffer isolation:
  - Result DMA buffers are allocated from the submitting process's
    address space (via IOMMU-mapped regions, same as GPU DMA).
  - CSD cannot DMA to arbitrary host memory — IOMMU enforces this.

33.5 CSD Error Handling

Error scenarios and kernel response:

Timeout (program runs too long):
  1. CSD command timeout (default: 30s, configurable via AccelBase).
  2. Kernel sends NVMe Abort command for the specific command ID.
  3. Returns -ETIMEDOUT to the submitting process.
  4. If abort fails: NVMe controller reset (same path as NVMe I/O timeout).

Hardware error (device reports failure):
  1. CSD returns NVMe status code (e.g., Internal Error, Data Transfer Error).
  2. Kernel maps to errno: -EIO for hardware faults, -ENOMEM for device
     memory exhaustion, -EINVAL for malformed programs.
  3. Error counter incremented in /sys/class/accel/csdN/errors.
  4. If error rate exceeds threshold: driver marks device degraded,
     stops accepting new submissions, notifies userspace via udev event.

Device reset:
  1. NVMe controller reset via PCIe FLR (Function Level Reset).
  2. All in-flight CSD commands are failed with -EIO.
  3. Contexts are invalidated; processes must re-create them.
  4. Same recovery path as standard NVMe timeout handling in Linux.

33.6 Linux Compatibility

NVMe Computational Storage is defined in separate NVMe technical proposals — primarily TP 4091 (Computational Programs) and TP 4131 (Subsystem Local Memory) — not in the NVMe 2.0 base specification. These TPs define the Computational Programs I/O command set and the Subsystem Local Memory I/O command set as independent command sets within the NVMe 2.0 specification library architecture (which separates base spec, command set specs, and transport specs into distinct documents). Linux support is emerging (/dev/ngXnY namespace devices). ISLE supports the same device files and NVMe ioctls through isle-compat.

CSD Programming Model:

CSD programs are opaque command buffers — the kernel does not interpret or compile them. The programming model:

  1. Vendor SDK in userspace: Each CSD vendor provides a userspace SDK that compiles programs for their embedded processor (e.g., Samsung SmartSSD SDK, ScaleFlux CSD SDK).
  2. NVMe TP 4091 (Computational Programs): The NVMe technical proposal defines a standard command set for managing computational programs on CSDs. Programs are uploaded via NVMe admin commands and executed via NVMe I/O commands.
  3. Kernel role: The kernel manages namespace access (capability-gated), DMA buffer allocation (IOMMU-protected), command timeout enforcement, and error reporting. The kernel does NOT validate program correctness — that is the vendor SDK's responsibility.

CSD Data Affinity:

For workloads that benefit from computational storage, data should be placed on the CSD's local namespaces:

  • Filesystem-level routing: Mount a CSD-backed filesystem and place data files on it. CSD compute programs access data locally (no PCIe transfer).
  • Cgroup hint: csd.preferred_device cgroup knob suggests which CSD device should be preferred for new file allocations within that cgroup's processes. Advisory only — the filesystem makes the final placement decision.
  • Explicit placement: Applications using O_DIRECT + the NVMe passthrough interface can target specific CSD namespaces directly.

33.7 Performance Impact

CSD offload reduces host CPU usage and PCIe bandwidth consumption. Performance improves for data-heavy workloads. Zero overhead when CSDs are not present.


34. SATA/AHCI and Embedded Flash Storage

SATA and eMMC are general-purpose block storage buses present in servers, edge nodes, embedded systems, and consumer devices alike. They belong in the core storage architecture alongside NVMe.

34.1 SATA/AHCI

SATA (Serial ATA) remains widely deployed: HDDs in cold/warm storage tiers, SATA SSDs in cost-sensitive edge nodes, and legacy server hardware. AHCI (Advanced Host Controller Interface) is the standard host-side register interface for SATA controllers.

Driver tier: Tier 1. SATA is a block-latency-sensitive path.

AHCI register interface: The AHCI controller exposes a set of memory-mapped registers (HBA memory space, BAR5) and per-port command list / FIS receive areas. The driver:

  1. Discovers ports via HBA_CAP.NP (number of ports).
  2. For each implemented port: reads PxSIG to identify device type (ATA, ATAPI, PM, SEMB).
  3. Issues IDENTIFY DEVICE (ATA command 0xEC) to retrieve geometry, capabilities, LBA48 support, NCQ depth.
  4. Allocates per-port command list (up to 32 slots) and FIS receive buffer.
  5. Registers the device with isle-block as a BlockDevice with sector size 512 or 4096 (Advanced Format).

Command submission: AHCI uses a memory-based command list. Each command slot contains a Command Table with a Physical Region Descriptor Table (PRDT) for scatter-gather DMA. Native Command Queuing (NCQ, up to 32 outstanding commands) is used when the device reports IDENTIFY.SATA_CAP.NCQ_SUPPORTED.

/// AHCI port state (per-port, Tier 1 driver domain).
pub struct AhciPort {
    /// MMIO base for this port's register set.
    mmio: PortedMmio,
    /// Command list: up to 32 slots, each 32 bytes (AHCI 1.3.1 §4.2.2).
    cmd_list: DmaBox<[AhciCmdHeader; 32]>,
    /// FIS receive area: 256 bytes (AHCI 1.3.1 §4.2.1).
    fis_rx: DmaBox<AhciFisRxArea>,
    /// Per-slot command tables (scatter-gather descriptors).
    cmd_tables: [DmaBox<AhciCmdTable>; 32],
    /// Tracks which command slots are in-flight.
    inflight: AtomicU32,
}

Power management: AHCI supports three interface power states: Active, Partial (~10ms wake), Slumber (~200ms wake). The driver uses Aggressive Link Power Management (ALPM) to enter Partial/Slumber when the port is idle. On system suspend (§14a.11), the driver flushes the write cache (FLUSH CACHE EXT, ATA 0xEA) and issues STANDBY IMMEDIATE (ATA 0xE0) before the controller is powered down.

Integration with §29 Block I/O: AHCI ports register as BlockDevice instances with isle-block. The volume layer (§29) treats SATA devices identically to NVMe namespaces — RAID, dm-crypt, dm-verity, thin provisioning all work on SATA block devices without modification.

34.2 eMMC (Embedded MultiMediaCard)

eMMC is a managed NAND flash storage interface used in embedded systems, edge servers with soldered storage, and cost-sensitive devices. The host interface is a parallel bus (up to 8-bit data width) with an MMC command set.

Driver tier: Tier 1 for the MMC host controller; device command processing follows the same ring buffer model as NVMe.

eMMC register interface: The eMMC host controller (typically SDHCI-compatible or vendor-specific) exposes MMIO registers for command/response, data FIFO, and interrupt status. The driver:

  1. Initializes the host controller and negotiates bus width (1/4/8-bit) and speed (HS200/HS400 where supported).
  2. Issues CMD8 (SEND_EXT_CSD) to retrieve the extended CSD register (512 bytes), which contains capacity, supported features, lifetime estimation, and write-protect status.
  3. Registers partitions (boot partitions BP1/BP2, RPMB, user area, general purpose partitions) as separate BlockDevice instances with isle-block.

RPMB (Replay-Protected Memory Block): eMMC RPMB is a hardware-authenticated storage area with replay protection, used for secure credential storage (e.g., TPM secrets, disk encryption keys). Access requires HMAC-SHA256-authenticated commands using a device-specific key programmed once at manufacturing. The kernel exposes RPMB as a capability-gated block device; only processes with the CAP_RPMB_ACCESS capability can issue RPMB commands.

Lifetime and wear: The Extended CSD PRE_EOL_INFO and DEVICE_LIFE_TIME_EST fields report device health. The kernel reads these periodically and exposes them via sysfs (/sys/block/mmcblk0/device/life_time). No kernel policy is applied — userspace storage daemons make retention/migration decisions.

Integration with §29: eMMC user-area partitions register as BlockDevice instances. All §29 volume management targets (dm-crypt, dm-mirror, dm-thin) work on eMMC partitions identically to NVMe namespaces.

34.3 SD Card Reader (SDHCI)

SDHCI (SD Host Controller Interface) is the standard register interface for built-in SD card slot controllers. SD cards register as BlockDevice instances with isle-block.

Driver tier: Tier 1.

Speed mode negotiation: UHS-I (SDR104, 208 MB/s max), UHS-II (312 MB/s), and UHS-III (624 MB/s) negotiated per JEDEC SD 8.0 spec. The driver reads the SD card's OCR, CID, CSD, and SCR registers at initialization to determine supported speed modes and switches the bus to the highest mutually supported mode.

Presence detection: SD cards are hot-plug devices. The SDHCI controller raises an interrupt on card insertion/removal. The driver posts a BlockDeviceChanged event to the system event bus (§16b, isle-core) on state change.

Consumer vs. embedded: SD cards are used in consumer laptops (built-in SD slot), embedded systems (primary boot/storage medium), and IoT devices. The SDHCI driver is general-purpose; its presence in consumer devices is the most common deployment.


35. Filesystem Drivers: ext4, XFS, and Btrfs

The kernel ships three general-purpose local filesystem drivers. All three implement the FileSystemOps and InodeOps traits defined in §27a (VFS layer). All three are used in server, workstation, embedded, and consumer contexts; they are not consumer-specific.

35.1 ext4

Use cases: Default Linux filesystem. Ubiquitous across servers, containers (overlayfs on ext4), embedded roots, VM images, CI/CD storage nodes, and most existing Linux deployments. ISLE must read/write ext4 volumes from day one for bare-metal Linux migration compatibility.

Tier: Tier 1 (in-kernel driver; no privilege boundary makes sense for a root filesystem that must be available before any domain infrastructure is up).

Journal modes (selected at mount time via data= option):

Mode What is journalled Durability on crash
data=writeback Metadata only Stale data may appear in reallocated blocks
data=ordered (default) Metadata only; data flushed before metadata commit No stale data
data=journal Metadata and data Strongest; ~2× write amplification

ISLE exposes these as mount flags via the FileSystemOps::mount() options string, consistent with Linux behaviour. The VFS durability contract (§27) requires data=ordered or data=journal to satisfy O_SYNC/fsync guarantees; drivers must reject data=writeback if the volume is mounted as a root or journalled data store unless the operator explicitly overrides.

Key features the driver must implement: - Extents (ext4_extent_tree): 48-bit logical-to-physical mapping via a four-level B-tree embedded in the inode. Supports extents up to 128 MiB contiguous. Replaces the older indirect-block scheme (must also be readable for old volumes without the extents feature flag). - HTree directory indexing: dir_index feature flag. Directories stored as B-trees keyed by filename hash (half-MD4). Required for directories with more than ~10,000 entries; without it readdir degrades to O(n). - 64-bit support: 64bit feature flag extends block count from 32 to 48 bits, enabling volumes >16 TiB. Required for modern datacenter deployments; the driver must handle both 32-bit and 64-bit superblocks. - Inline data: Small files (≤60 bytes) stored directly in the inode body. Important for filesystems hosting millions of tiny files (container layers, npm caches). - Fast commit (fast_commit feature, Linux 5.10+): Appends a small delta journal entry instead of a full transaction commit for common operations (rename, link, unlink). Reduces journal write amplification by 4–10× for metadata-heavy workloads.

Crash recovery: Replay the ext4 journal (jbd2 compatible format) on mount. The VFS freeze/thaw interface (§27a freeze() / thaw()) is used for consistent snapshots (LVM thin, VM live migration).

Linux compatibility: ISLE's ext4 driver is wire-compatible with Linux's ext4. Volumes formatted with mkfs.ext4 on Linux are mountable by ISLE without conversion. The tune2fs -l feature list (FEATURE_COMPAT, FEATURE_INCOMPAT, FEATURE_RO_COMPAT) governs which features are required vs. optional; the driver rejects mount if any INCOMPAT bit is set that it does not understand.

35.2 XFS

Use cases: Default filesystem on RHEL, CentOS, Fedora, Rocky Linux, and Oracle Linux. Dominant in enterprise storage servers, HPC scratch filesystems, media production storage, and large-scale NFS servers. Designed for very large files and very large directories.

Tier: Tier 1 (same rationale as ext4).

Design:

XFS partitions the volume into allocation groups (AGs), each an independent unit with its own free-space B-trees (bnobt, cntbt), inode B-tree (inobt), and reverse-mapping B-tree (rmapbt, v5 only). Allocation groups enable parallel allocation for multi-threaded workloads — different AGs are independent, so concurrent file creation on different CPUs does not serialize.

Volume layout (simplified):
  [ Superblock | AG 0 | AG 1 | ... | AG N ]
  Each AG: [ AG header | free-space B-trees | inode B-tree | data blocks ]

Key features: - Delayed allocation (delalloc): Blocks are not physically allocated until writeback, allowing the allocator to choose large contiguous extents instead of the first available fragment. Critical for streaming-write performance. - Speculative preallocation: XFS preallocates beyond the current EOF during sequential writes, then trims unused preallocation on close. Dramatically reduces fragmentation for growing files (logs, databases, media files). - Reflink (XFS v5, Linux 4.16+): Copy-on-write extent sharing for cheap file copies (same semantic as Btrfs reflinks). Required for efficient container image layering and cp --reflink. - Reverse mapping B-tree (rmapbt, v5): Tracks which owner (inode or B-tree structure) holds each physical block. Required for online scrub, online repair, and reflink. Adds ~5% space overhead. - Real-time device: XFS optionally uses a separate real-time device for files tagged with XFS_XFLAG_REALTIME, guaranteeing allocation from a contiguous extent region. Used in HPC and media production for deterministic I/O latency. ISLE supports the real-time device as a second BlockDevice passed in the mount option rtdev=. - xattr namespaces: user., trusted., security., system.posix_acl_*. The trusted. namespace is restricted to CAP_SYS_ADMIN; the kernel enforces this via capability checks in setxattr(2).

Journal (xlog): XFS uses a write-ahead log (xlog) for all metadata mutations. The log is circular; the driver replays from the last checkpoint on mount after unclean shutdown. Log can be on the same device (default) or an external device (logdev=) for better write isolation on HDD-based arrays.

Linux compatibility: XFS v5 (superblock sb_features_incompat bit XFS_SB_FEAT_INCOMPAT_FTYPE) is required for all new volumes. v5 includes a CRC checksum on every metadata block (Fletcher-32), catching silent corruption that ext4 without metadata checksums would miss. ISLE rejects mounting v4 volumes unless a compatibility shim is provided (v4 is deprecated upstream as of Linux 6.x and not worth supporting at launch).

35.3 Btrfs

Use cases: Default filesystem for ISLE desktop/laptop deployments (see consumer roadmap §56.9, 14-roadmap.md; open questions decision), Fedora workstations, Steam Deck. Also used in enterprise for its snapshot and send/receive capabilities (Proxmox, SUSE). Relevant at kernel level wherever atomic snapshots, compression, or multi-device volumes are needed.

Tier: Tier 1.

Design: Btrfs is a copy-on-write (CoW) B-tree filesystem. Every write produces a new copy of the modified data/metadata; the old copy is retained until freed. This is the foundation for snapshots (zero-cost at creation) and atomic multi-file transactions.

Key features:

Feature Kernel behaviour
Subvolumes Independent CoW trees within a volume; each mountable separately. The kernel tracks the active subvolume ID per mount point.
Snapshots Read-write or read-only clone of a subvolume at a point in time. Zero-cost creation (no data copied). Used by ISLE live update rollback (§52).
Reflinks Shallow file copy (cp --reflink). Shares extent references until written. Critical for container runtimes and package managers.
Transparent compression Per-file or per-subvolume, online. Algorithms: LZO (fast), ZLIB (balanced), ZSTD (best ratio, default for ISLE). Kernel compresses on writeback; decompresses on read.
RAID profiles RAID 0 / 1 / 1C3 / 1C4 / 5 / 6 / 10 across multiple BlockDevice instances. RAID 5/6 has known write-hole issues (upstream caveat); ISLE documents this and defaults to RAID 1 for redundant volumes.
Online scrub Background verification of all data and metadata checksums. Driven by a kernel thread (btrfs-scrub); progress exposed via ioctl and sysfs.
Send/receive Incremental snapshot delta serialisation. btrfs send produces a stream; btrfs receive applies it on another volume. Used for backup, replication, and container image distribution.
Free space tree v2 free-space cache (b-tree based); replaces the v1 file-based cache. Required for large volumes (>1 TiB); ISLE always mounts with space_cache=v2.

CoW and O_SYNC interaction: Because Btrfs delays the final tree root update until transaction commit, fsync must trigger a full transaction commit (not just a data flush) to satisfy durability. The driver calls btrfs_commit_transaction() on fsync for non-nodatacow files. This is a known latency source for databases; the architecture recommends nodatacow mount option for database subvolumes (trades crash consistency for performance, consistent with how PostgreSQL and MySQL recommend mounting their data directories on any CoW filesystem).

Live update integration (§52): Btrfs subvolume snapshots can support snapshot-based atomic OS updates. A live update agent can create a read-only snapshot of the root subvolume before applying an update, making rollback trivial and zero-downtime. This makes Btrfs a natural fit for deployments that use snapshot-based atomic updates; on servers where ext4 or XFS is already in use, this advantage does not justify a migration.

Linux compatibility: Btrfs on-disk format is stable since Linux 3.14. ISLE's Btrfs driver is wire-compatible with Linux's. Volumes created on Linux are mountable by ISLE. Feature detection uses the incompat_flags superblock field; the driver rejects mount if any unknown INCOMPAT bit is set.

Limitations documented: - RAID 5/6: write-hole not fixed upstream as of Linux 6.9. Use RAID 1 or RAID 10 for redundant server volumes. - nodatacow files cannot have checksums. Applications that disable CoW for performance must accept no data integrity checking on those files. - Very large directories (>1M entries) have worse performance than XFS due to CoW overhead on directory mutations.

35.4 Removable Media and Interoperability Filesystems

These filesystem drivers serve interoperability with Windows, macOS, and removable media standards. They are not consumer-specific — embedded systems, edge nodes, and industrial devices also use FAT/exFAT/NTFS for removable storage interoperability.

35.4.1 exFAT

Use case: SDXC (SD cards >32 GB) mandates exFAT per JEDEC SD specification. USB flash drives commonly use exFAT. Required for read/write interop with Windows and macOS systems.

Tier: Tier 1 (in-kernel driver).

Implementation: exFAT on-disk format. Microsoft opened the exFAT specification in 2019 (no licensing encumbrance). ISLE implements a native kernel exFAT driver. Alternatively, exfatprogs via FUSE may be used for initial phases.

Compatibility: Read/write. Files up to 16 EiB. Directory entries use UTF-16. Volume label, timestamps with UTC offset. No journaling (power-loss may corrupt).

35.4.2 NTFS

Use case: External drives shared with Windows installations. Common on USB hard drives purchased pre-formatted.

Tier: Tier 1 (in-kernel ntfs3 driver; merged upstream Linux 5.15).

Implementation: ISLE's ntfs3 driver is based on the upstream Linux ntfs3 implementation by Paragon Software Group. Provides full read/write support including NTFS compression, sparse files, and hard links. Does not support Windows-only features (Alternate Data Streams in the mount namespace, reparse points for Windows symlinks) — these return errors on access.

Linux compatibility: Wire-compatible with Linux ntfs3. Volumes created on Linux NTFS are mountable by ISLE.

35.4.3 APFS (Read-Only)

Use case: External drives from macOS systems. Low priority (few Linux/ISLE users have APFS volumes to access).

Tier: Userspace FUSE driver (not in-kernel).

Implementation: Read-only access via FUSE-based driver. No write support (APFS volume metadata format is proprietary; write support risks corruption).