Sections 22–26 of the ISLE Architecture. For the full table of contents, see README.md.

Part VI: Security and Integrity

Verified boot, TPM, integrity measurement, post-quantum cryptography, and confidential computing.

22. Verified Boot Chain

Inspired by: ChromeOS Verified Boot, Android dm-verity, UEFI Secure Boot. IP status: Clean — industry standards (UEFI Secure Boot, TCG TPM), standard cryptographic constructions (Merkle trees, 1979). Public specifications.

22.1 Problem

Section 17 defines the boot sequence but does not address boot integrity. A compromised bootloader or kernel image is the most dangerous attack vector — it undermines all runtime security.

Additionally, in production deployments (cloud, embedded, enterprise), operators need assurance that the running kernel is exactly what they deployed, with no tampering.

22.2 Boot Chain Verification

Firmware (UEFI Secure Boot)
    |
    | Verifies: GRUB/bootloader signature
    v
Bootloader (GRUB2 / systemd-boot)
    |
    | Verifies: ISLE image signature
    v
ISLE Boot Stub (early Rust/asm)
    |
    | Verifies: initramfs signature
    v
ISLE Core Initialization
    |
    | Verifies: Tier 0 driver integrity (embedded in kernel image)
    | Verifies: Tier 1 driver signatures (loaded from initramfs/rootfs)
    v
Running System
    |
    | dm-verity: runtime block-level integrity verification
    v
Verified Root Filesystem

Each step verifies the next. A break at any point halts boot (or falls back to a known good configuration).

22.3 Kernel Image Signing

The ISLE image (vmlinuz-isle-VERSION) is signed during the build process:

// Build-time signature structure appended to kernel image.
#[repr(C)]
pub struct KernelSignature {
    /// Magic: "IKSIG\0\0\0"
    pub magic: [u8; 8],
    /// Signature algorithm ID (matches SignatureAlgorithm enum, Section 25.2):
    /// 0x0001 = Ed25519, 0x0002 = RSA-4096-PSS,
    /// 0x0100 = ML-DSA-44, 0x0101 = ML-DSA-65, 0x0110 = SLH-DSA-128f,
    /// 0x0200 = hybrid Ed25519 + ML-DSA-65.
    /// Algorithm ID 0 is reserved/invalid — a zero value indicates an
    /// unsigned or corrupt image.
    pub algorithm: u32,
    /// Length of actual signature data within the signature buffer.
    pub sig_len: u32,
    /// SHA-256 hash of the unsigned kernel image (informational).
    /// This field records the hash at signing time for offline auditing
    /// and debugging (e.g., comparing against a known-good manifest).
    /// It is NOT used during boot verification — the boot stub computes
    /// a fresh hash (step 3 below) and verifies the signature over that,
    /// eliminating TOCTOU attacks on a stored hash value.
    pub image_hash: [u8; 32],
    /// Signature buffer. Sized for the largest supported algorithm
    /// rounded up to 512-byte alignment:
    /// ML-DSA-65 = 3,309 bytes, SLH-DSA-128f = 17,088 bytes,
    /// hybrid = Ed25519 (64) + ML-DSA-65 (3,309) = 3,373 bytes.
    /// SLH-DSA-128f is the worst case at 17,088 bytes.
    /// Buffer = ceil(17,088 / 512) * 512 = 17,408 bytes.
    /// Only sig_len bytes are meaningful; the rest is zero-padded.
    pub signature: [u8; 17_408],
    /// Public key fingerprint (SHA-256 of the public key).
    pub key_fingerprint: [u8; 32],
}

Early boot verification memory: ML-DSA-65 verification requires approximately 4KB of scratch memory for NTT polynomial operations and matrix computations. During early boot (before the slab allocator is initialized), this scratch space is provided by a statically-allocated .bss section buffer: static VERIFY_SCRATCH: [u8; 8192]. This buffer is large enough for both ML-DSA-65 verification (~4KB) and SHA-256 streaming state (~200 bytes). The boot stub uses this fixed buffer exclusively; it is released after the kernel signature is verified. The slab allocator path (SignatureData::Heap) is used only for runtime driver signature verification after boot.

Verification flow in the boot stub:

1. Boot stub finds KernelSignature at the end of the image.
2. Reads the public key from:
   a. UEFI Secure Boot db (if UEFI boot), OR
   b. Embedded in bootloader (if BIOS boot), OR
   c. Kernel command line: isle.verify_key=<fingerprint>
      (only honored in isle.verify=warn or isle.verify=off modes;
       ignored in isle.verify=enforce mode — see security restriction below)
3. Computes SHA-256 of the image (excluding signature) → fresh_hash.
4. Verifies signature over fresh_hash (not over stored image_hash).
   This eliminates the TOCTOU window: the hash used for signature
   verification is the one just computed, not a stored value an
   attacker could tamper with between steps.
5. If verification fails:
   a. If isle.verify=enforce (default in production): halt boot.
   b. If isle.verify=warn: log warning, continue (development mode).
   c. If isle.verify=off: skip (testing only).

Security restriction: In isle.verify=enforce mode, the isle.verify_key command-line parameter is ignored — the verification key MUST come from the firmware (UEFI db variable or DTB /chosen/isle,verify-key node), which is itself part of the measured/verified boot chain. The command-line parameter is only honored in isle.verify=warn (development) and isle.verify=off (debugging) modes. This prevents an attacker who controls the boot loader command line from substituting their own verification key.

22.4 Driver Signature Verification

Tier 1 drivers are signed (Section 6 mentions "cryptographically signed drivers can be granted Tier 1"). The device registry (Section 7) enforces this during the Loading → Initializing transition.

// In driver ELF binary, `.kabi_sig` section.
#[repr(C)]
pub struct DriverSignature {
    pub magic: [u8; 8],         // "NDSIG\0\0\0"
    /// Algorithm IDs: same as KernelSignature.algorithm (SignatureAlgorithm
    /// enum, Section 25.2). Driver signatures typically use ML-DSA-44
    /// (2,420 bytes).
    pub algorithm: u32,
    pub sig_len: u32,
    pub binary_hash: [u8; 32],  // SHA-256 of driver ELF (excluding .kabi_sig)
    /// Sized for worst-case algorithm (SLH-DSA-128f = 17,088 bytes,
    /// rounded up to 512-byte alignment = 17,408 bytes). This ensures
    /// the struct is algorithm-agile: the same binary layout works for
    /// any supported signature algorithm without recompilation.
    /// ML-DSA-44 (common case for drivers) uses 2,420 bytes; rest zero-padded.
    /// sig_len indicates how many bytes are meaningful.
    pub signature: [u8; 17_408],
    pub key_fingerprint: [u8; 32],
    pub signer_name: [u8; 64],  // Human-readable signer identity
}

Policy (integrates with existing tier trust model):

Verification	Tier 0	Tier 1	Tier 2
Signature required?	Always (part of kernel image)	Configurable (default: required)	Optional
Unsigned behavior	Cannot exist unsigned	Demoted to Tier 2	Allowed (user-space isolation)
Tampered binary	Kernel does not boot	Loading rejected, alert	Loading rejected, alert

22.5 Runtime Filesystem Integrity (dm-verity)

For production deployments, the root filesystem can be verified on every block read using Merkle tree verification. This detects tampering of any file at read time.

Root Filesystem Device           Hash Tree Device
+------------------+             +------------------+
| Block 0          | --hash-->   | Leaf hash 0      |
| Block 1          | --hash-->   | Leaf hash 1      |
| Block 2          | --hash-->   | Leaf hash 2      |
| Block 3          | --hash-->   | Leaf hash 3      |
| ...              |             | ...              |
+------------------+             +------------------+
                                         |
                                   +-----+-----+
                                   |           |
                                 Node 0-1    Node 2-3
                                   |           |
                                   +-----+-----+
                                         |
                                    Root Hash
                                   (signed, stored
                                    in kernel cmdline
                                    or bootloader)

Implementation in ISLE Block I/O layer:

// isle-block/src/verity.rs

/// Maximum supported hash tree depth. A depth of 20 with 4096-byte
/// hash blocks covers 4096^20 blocks, far exceeding any physical device.
const VERITY_MAX_TREE_DEPTH: usize = 20;

pub struct VerityTarget {
    /// Underlying data device.
    data_device: BlockDeviceHandle,

    /// Hash tree device (can be same device, appended).
    hash_device: BlockDeviceHandle,

    /// Hash algorithm: SHA-256 (default).
    hash_algorithm: HashAlgorithm,

    /// Block size for hashing (default: 4096).
    hash_block_size: u32,

    /// Root hash (trusted, from boot cmdline or signature).
    root_hash: [u8; 32],

    /// Hash tree depth (computed from device size).
    tree_depth: u32,

    /// Pre-computed hash tree level offsets.
    /// A typical dm-verity device has at most ~20 hash tree levels
    /// (a 20-level tree covers 4096^20 blocks, far exceeding any
    /// physical device). A fixed-size array avoids heap allocation
    /// entirely, which is essential during early boot before the
    /// general-purpose allocator is initialized.
    level_offsets: [u64; VERITY_MAX_TREE_DEPTH],

    /// Number of valid entries in level_offsets (always <= MAX_TREE_DEPTH).
    level_count: u32,

    /// Error behavior on verification failure.
    error_behavior: VerityErrorBehavior,

    /// Cache of verified hashes (avoid re-hashing on repeated reads).
    /// None until the slab allocator is online; verification operates
    /// without caching. Initialized to Some(...) during block subsystem
    /// init.
    hash_cache: Option<LruCache<u64, [u8; 32]>>,
}

#[repr(u32)]
pub enum VerityErrorBehavior {
    /// Return -EIO for the failed block (default).
    Eio     = 0,
    /// Kernel panic (for security-critical deployments).
    Panic   = 1,
    /// Log and continue (for debugging).
    Ignore  = 2,
}

Activation via kernel command line (standard dm-verity syntax):

root=/dev/dm-0
dm-mod.create="isle-root,,,ro,0 <size> verity 1 /dev/sda1 /dev/sda2 4096 4096 <blocks> <blocks> sha256 <root_hash> <salt>"

Or via the device-mapper ioctl interface (DM_TABLE_LOAD), which veritysetup from cryptsetup uses. Existing tools work unmodified.

22.6 Key Revocation

Driver signature verification checks against a Key Revocation List (KRL) to handle compromised signing keys:

The KRL is a signed, append-only list of revoked key fingerprints, embedded in the initramfs and updated with each kernel/initramfs build.
On boot, the KRL is loaded and checked before any driver loading. Any driver signed with a revoked key is rejected, regardless of tier.
At runtime, a new KRL can be loaded via /proc/isle/security/krl (requires admin capability). This allows revoking keys without rebooting.
UEFI Secure Boot dbx provides bootloader/kernel-level revocation (existing standard mechanism). The KRL extends this to cover driver-level revocation.

// Kernel-internal

/// Key revocation list.
///
/// **Lifetime management**: Both the `KeyRevocationList` struct and the
/// `revoked_keys` array it points to are allocated together as a single
/// slab allocation (struct header + trailing key array). Boot-time KRLs
/// are bump-allocated and have true `'static` lifetime. Runtime-loaded
/// KRLs are slab-allocated and managed via RCU: the old KRL (struct +
/// key array) remains valid until all readers complete their RCU
/// read-side critical section, then the entire slab slot is freed.
/// Readers must access the struct and its `revoked_keys` only within
/// `rcu_read_lock()` / `rcu_read_unlock()`.
#[repr(C)]
pub struct KeyRevocationList {
    /// Signature over the KRL itself (prevents tampering).
    /// Sized for worst-case PQC signature (SLH-DSA-128f = 17,088 bytes,
    /// rounded up to 512-byte alignment = 17,408 bytes).
    pub signature: [u8; 17_408],
    /// Actual length of the signature in `signature[]`.
    /// Algorithms produce different-length signatures (Ed25519: 64 bytes,
    /// ML-DSA-65: 3,309 bytes, SLH-DSA-128f: 17,088 bytes).
    pub sig_len: u32,
    /// Algorithm hint for this KRL's signature. Matches `SignatureAlgorithm`
    /// enum (Section 25.2). This field is a parser optimization hint only —
    /// the verifier MUST NOT trust it as authoritative. The actual verification
    /// algorithm is determined by the signing key: each public key in the
    /// kernel's trusted keyring has a fixed algorithm, and the verifier enforces
    /// that `algorithm` matches the key's algorithm. A mismatch causes
    /// verification failure (`-EKEYREJECTED`), preventing algorithm-confusion
    /// attacks where an attacker specifies a weaker algorithm in a forged KRL.
    pub algorithm: u32,
    /// Version (monotonically increasing, prevents rollback).
    /// Anti-rollback enforcement: on every KRL load, the kernel
    /// compares this version against the last-seen KRL version stored
    /// in TPM NV index 0x01C10200 (a monotonic counter). If the new
    /// version is less than the stored value, the KRL is rejected
    /// (prevents an attacker from replaying an older KRL with fewer
    /// revocations). On successful load, the TPM NV counter is updated
    /// to the new version. On systems without a TPM, the last-seen
    /// version is stored in a UEFI authenticated variable
    /// (isle-krl-version); this is weaker (vulnerable to firmware-level
    /// attacks) but still prevents userspace rollback.
    pub version: u64,
    /// Public key fingerprints (SHA-256 hash of the raw public key bytes),
    /// sorted for binary search. Algorithm-agnostic: the fingerprint is a
    /// SHA-256 hash regardless of whether the key is Ed25519, ML-DSA-44,
    /// ML-DSA-65, or any future algorithm — all produce a 32-byte fingerprint.
    /// Uses `RcuSlice<[u8; 32]>` — a zero-cost wrapper around the raw pointer
    /// that can only be dereferenced within an `RcuReadGuard` scope, preventing
    /// use-after-free if the backing slab allocation is freed after a grace period.
    /// Boot-time KRLs are bump-allocated (effectively 'static); runtime-loaded
    /// KRLs are slab-allocated and freed after an RCU grace period.
    /// Length is given by `revoked_count`.
    pub revoked_keys: RcuSlice<[u8; 32]>,
    /// Number of entries pointed to by `revoked_keys`.
    /// Required because `RcuSlice` does not carry length information.
    pub revoked_count: u32,
    /// SHA-256 hash of the serialized KRL (for TPM PCR extend).
    pub digest: [u8; 32],
}

/// **Lifetime safety**: The `revoked_keys` field uses `RcuSlice<[u8; 32]>`,
/// a zero-cost wrapper around `*const [u8; 32]` that implements `Deref` only
/// when an `RcuReadGuard` is in scope. `RcuSlice` does NOT implement `Send`
/// or `Sync` directly — the containing `KeyRevocationList` is shared via
/// `RcuCell<KeyRevocationList>`, which provides the necessary lifetime
/// guarantees (readers hold `RcuReadGuard`, preventing grace period completion
/// and slab freeing). The blanket `unsafe impl Send/Sync` on `KeyRevocationList`
/// is safe because the struct is only accessible through `RcuCell`, never
/// directly shared.
// SAFETY: `revoked_keys` is an `RcuSlice` whose pointee remains valid for the
// lifetime of any `RcuReadGuard` that can reach this struct. The struct is
// read-only after construction (immutable once published via RCU). Access
// outside an RCU read-side critical section is prevented by `RcuSlice`'s API,
// making cross-thread send and share safe.
unsafe impl Send for KeyRevocationList {}
unsafe impl Sync for KeyRevocationList {}

The verification flow during driver loading:

1. Extract key_fingerprint from DriverSignature.
2. Binary search KRL for the fingerprint.
3. If found: reject driver load with -EKEYREVOKED, emit alert.
4. If not found: proceed with normal signature verification.

22.6.1 RCU Read-Side Timeout (DoS Mitigation)

Attack vector: RCU-based KRL management has a theoretical denial-of-service vulnerability. An attacker could keep an RCU read-side critical section open indefinitely by repeatedly calling a syscall that reads from the KRL (e.g., driver loading queries) without allowing the critical section to exit. This would prevent the RCU grace period from ever completing, blocking KRL updates and preventing revocation from taking effect.

Mitigation: Preemptible RCU with priority boosting. ISLE uses preemptible RCU for KRL read-side critical sections — readers can be preempted by higher-priority tasks, preventing a single reader from indefinitely blocking grace period completion. When a grace period has been pending for longer than KRL_RCU_BOOST_MS (default: 1ms), the RCU subsystem boosts the priority of all in-progress KRL readers to SCHED_FIFO priority 99, expediting their completion. This is the same mechanism used by Linux's CONFIG_RCU_BOOST — it respects RCU semantics (readers complete naturally rather than being forcibly terminated) and avoids the use-after-free risk of forcibly calling rcu_read_unlock() from outside the critical section.

Implementation notes: - KRL readers are expected to be brief: a binary search over the revoked_keys array is O(log n) and does not perform blocking operations. Any syscall that blocks while holding KRL RCU read lock is a bug. - Priority boosting is a safety net, not a normal-path mechanism. Under normal operation, KRL lookups complete in <10μs — well before the 1ms boost threshold. - On boost trigger, the kernel logs a warning with the CPU ID, task PID, and execution context. Repeated boosts from the same task trigger rate limiting (log suppression) to avoid log flooding. - Syscall rate limiting: The driver-load syscall (isle_driver_load) is rate-limited to at most 10 invocations per second per UID (using a per-UID token bucket). This prevents a malicious userspace process from abusing repeated driver load requests to keep KRL RCU read-side critical sections active across many CPUs and delay grace period completion. Excess requests fail with -EAGAIN. This is the primary defense; priority boosting is the secondary safety net. - This mitigation is defensive: a compromised kernel (or a bug in a Tier 1 driver) could bypass it, but that is outside the threat model. The RCU boost + rate limiting protects against userspace-triggered DoS only.

23. TPM Runtime Services

23.1 Measured Boot (TPM)

For environments with TPM (Trusted Platform Module):

1. UEFI firmware measures itself into TPM PCR 0-7 (standard, not our concern).
2. Bootloader measures kernel image into TPM PCR 8.
3. ISLE Core boot stub measures:
   - Kernel command line → PCR 8
   - Initramfs image → PCR 9
   PCR 10 is reserved for IMA runtime measurements (Section 24) and
   is NOT extended during boot. This avoids collisions with Linux IMA,
   which uses PCR 10 by default — a verifier must be able to distinguish
   boot-time measurements from runtime measurements.
4. ISLE Core measures loaded Tier 1 drivers → PCR 11.
5. Remote attestation server can verify the full boot chain via TPM quotes.

This is additive — TPM integration is optional and does not affect the boot path for systems without TPM.

23.2 TPM Runtime Services

Section 23 covers TPM as a boot measurement device. But TPM 2.0 is also a runtime crypto engine — a hardware security module built into every modern server and laptop. ISLE integrates TPM as a first-class runtime resource.

Linux TPM interfaces — Linux exposes /dev/tpm0 (raw character device for direct TPM command submission) and /dev/tpmrm0 (resource-managed access that multiplexes TPM sessions). ISLE provides both via the syscall interface for existing userspace tools (tpm2-tools, clevis, systemd-cryptenroll).

Key sealing and unsealing — TPM can seal secrets (encryption keys, credentials) to a specific PCR state. The secret can only be recovered if the PCRs match — meaning the system booted the expected kernel, with the expected drivers, in the expected configuration. If any component in the boot chain is modified or compromised, unsealing fails and the secret is protected. ISLE integrates sealing with its capability system: the CAP_TPM_SEAL capability is required to create sealed objects, and the resulting sealed blob is itself a capability-protected kernel object.

NV Indices (NVRAM) — TPM provides persistent, tamper-resistant storage called NV indices. These store small secrets (disk encryption keys, network credentials, device identity certificates) that survive reboots and are protected by TPM authorization policies. ISLE abstracts NV indices as capability-protected objects — reading/writing an NV index requires the appropriate capability token, and the authorization policy (password, HMAC, or PCR-bound) is enforced by the TPM hardware itself.

Hardware random number generator — TPM includes a hardware RNG (certified to NIST SP 800-90A). ISLE's entropy pool (which also draws from CPU RDRAND/RDSEED) mixes in TPM random data when a TPM is present. This provides defense-in-depth: if the CPU RNG is compromised (as has happened with certain microcode bugs), the TPM RNG provides independent entropy.

Authorization policies — TPM 2.0 supports rich authorization: - PCR-bound: seal/unseal only if PCRs match expected values (boot integrity) - Password-based: simple passphrase authorization - HMAC-based: proof of possession of a shared secret - Policy-based: compound policies combining PCR state, time-of-day, locality, NV counter values, and external authorization (e.g., "unseal only if PCR 11 matches AND a remote server approves")

ISLE's policy engine maps these to capability gates — a sealed key's authorization policy determines which capability holders can trigger unsealing.

TPM-backed disk encryption — dm-crypt volume keys sealed to TPM PCR state. On boot, ISLE's initramfs unseals the volume key from the TPM (no passphrase required if the boot chain is trusted). This replaces the Linux approach of requiring systemd- cryptenroll or clevis daemons with kernel-native TPM key management. If the boot chain changes (different kernel version, modified initramfs), the TPM refuses to unseal and the system falls back to passphrase entry.

Resource manager — TPM 2.0 has limited internal session/object slots. Linux requires either the in-kernel tpm_tis/tpm_crb resource manager or the userspace tpm2-abrmd daemon to multiplex access. ISLE's TPM driver includes a kernel-native resource manager that transparently handles context swapping, so multiple concurrent TPM users (disk encryption, IMA measurements, remote attestation, application key storage) never contend for TPM slots. No userspace daemon required.

Performance considerations — TPM operations are inherently slow (~5-50ms per command, depending on operation — RSA operations are the slowest). ISLE's async I/O model ensures TPM operations never block the caller synchronously. TPM commands are submitted via the ring buffer interface and completed asynchronously. For latency-sensitive paths (e.g., IMA measurement during file open), ISLE caches measurement results and only re-measures when file content changes.

24. Runtime Integrity Measurement (IMA)

Measured boot (Section 23) ensures the system booted a trusted kernel and drivers. But what about runtime — every binary executed, every library loaded, every config file read after boot? Runtime integrity measurement extends the trust chain from boot into ongoing operation.

Linux IMA measures files into TPM PCR 10 (or a designated PCR) at access time. It was bolted onto the VFS layer after the fact, with a standalone policy language and limited integration with the rest of the security stack. ISLE provides equivalent functionality, deeply integrated with the capability system and driver loading.

Measurement policy — Rules specify what to measure: - All executable files (mmap PROT_EXEC) - All shared libraries (dlopen) - Specific configuration files (e.g., /etc/fstab, /etc/passwd) - All files opened by processes holding specific capabilities - All files in specific filesystem subtrees

Policy rules are expressed as capability predicates — "measure all files opened by any process holding CAP_NET_ADMIN" — making the policy composable with ISLE's existing security architecture.

IMA appraisal (enforcement) — Beyond measurement (recording what was accessed), IMA appraisal enforces integrity. Each file has a signed hash (stored as an extended attribute or in a separate manifest). On access, the kernel computes the file's hash, verifies it against the signed reference, and refuses to execute/load the file if they don't match. This blocks tampered binaries at runtime, not just at audit time.

IMA and ISLE driver loading — Every Tier 1 driver load is already measured into PCR 11 (Section 23). IMA extends this to: - Tier 2 driver binaries (userspace drivers loaded via KABI) - Userspace helper executables invoked by drivers - Firmware blobs loaded by drivers - Configuration files that affect driver behavior

Audit log — All measurements are logged to ISLE's audit subsystem. The log is tamper-evident (hash-chained) and optionally signed. A remote attestation server can request a TPM quote over the measurement PCR plus the audit log, verifying not just "the system booted correctly" but "the system has only executed known-good software since boot."

EVM (Extended Verification Module) — Protects file metadata (permissions, ownership, extended attributes, ACLs) against offline tampering. Without EVM, an attacker with physical access could mount the disk on another system and modify file permissions (e.g., make /usr/bin/su setuid-root) without changing file contents. EVM stores an HMAC over file metadata as an extended attribute, verified on every access. The HMAC key is sealed to the TPM, so offline modification is detectable.

Performance impact — IMA measurement cost is dominated by SHA-256 hashing of file contents and scales linearly with file size. For typical small files (config files, shared libraries <1 MB): ~10-50 us. For large binaries (100 MB): ~200 ms at ~500 MB/s single-threaded SHA-256 throughput. The measurement cache (below) amortizes this cost -- after the first measurement, re-measurement occurs only on content change, so the large-file cost is paid once. Mitigations: - Measurement cache: after first measurement, the result is cached. Re-measurement occurs only if the file's content changes (detected via inode version counter) - Policy exemptions: transient paths (/tmp, /dev, /proc, /sys) are configurable exemptions — no point measuring pseudo-filesystems - Async measurement: for large files, measurement can be performed asynchronously during non-security-critical reads (e.g., data files opened for read-only access). However, security-critical paths are always synchronous: any file opened for execution (mmap PROT_EXEC, dlopen, driver loading) is measured synchronously before the content is used. This prevents a TOCTOU vulnerability where file content could change between measurement and execution. The synchronous path holds an exclusive file lock (preventing concurrent writes) while computing the hash and comparing it against the signed reference. The async path is used only for audit/measurement-only policy rules where appraisal enforcement is not required

Integration with dm-verity — For read-only filesystems (container images, system partitions), dm-verity (Section 22.5) provides block-level integrity verification. IMA is complementary: it covers read-write filesystems where dm-verity cannot apply (dm-verity requires immutable block devices). Together, they provide complete coverage: dm-verity for immutable system images, IMA for mutable data and user-installed software.

25. Post-Quantum Cryptography

25.1 Why This Cannot Wait

NIST finalized post-quantum cryptography standards in 2024: - ML-KEM (Kyber): key encapsulation (replaces ECDH/RSA key exchange) - ML-DSA (Dilithium): digital signatures (replaces RSA/Ed25519) - SLH-DSA (SPHINCS+): hash-based signatures (stateless, conservative)

"Harvest now, decrypt later" attacks mean data encrypted today with classical algorithms is vulnerable to future quantum computers. Migration timelines are 5-10 years — starting now is not optional.

ISLE uses cryptography in:

Component	Current Algorithm	PQC Replacement	Section
Verified boot (kernel signature)	RSA-4096 / Ed25519	ML-DSA-65 or SLH-DSA-128f	Section 22.3
Driver signatures	Ed25519	ML-DSA-44	Section 22.4
dm-verity Merkle tree	SHA-256	SHA-256 (quantum-safe as-is)	Section 22.5
Distributed capabilities	Ed25519	ML-DSA-44	Section 47.9
Cluster node authentication	X25519 key exchange + Ed25519 auth	ML-KEM-768 + ML-DSA-65	Section 47.2
TPM PCR measurements	SHA-256	SHA-256 (quantum-safe)	Section 23

25.2 Design: Algorithm-Agile Crypto Abstraction

The critical design requirement: never hardcode key/signature sizes. PQC signatures are much larger than classical:

Algorithm	Signature Size	Public Key Size
Ed25519 (current)	64 bytes	32 bytes
ML-DSA-44 (PQC)	2,420 bytes	1,312 bytes
ML-DSA-65 (PQC)	3,309 bytes	1,952 bytes
SLH-DSA-128f (PQC)	17,088 bytes	32 bytes

Algorithm selection criteria:

ML-DSA (default): Lattice-based. Fast signing/verification, moderate signature size (~2.4-3.3KB). Use for: driver signatures, capabilities, cluster authentication. This is the standard choice unless there is a specific reason to avoid lattice assumptions.
SLH-DSA (paranoid mode): Hash-based (stateless). Conservative — survives even if lattice mathematical assumptions are broken by future cryptanalysis. Huge signatures (~17KB). Use for: kernel image signatures (verified once at boot, size doesn't matter). Configurable: isle.crypto.boot_algorithm=slh-dsa-128f overrides default ML-DSA-65.
Hybrid mode: Both Ed25519 + ML-DSA. Use during transition period (2025-2035). Both must verify. Provides defense if either algorithm family is broken.

If capability tokens (Section 47.9) have a fixed 64-byte signature field, PQC won't fit. The signature field must be variable-length or large enough for the biggest PQC algorithm.

// isle-core/src/crypto/mod.rs

/// Signature algorithm identifier.
/// Used for verified boot, driver signatures, capabilities, and
/// any context where digital signatures are created or verified.
/// Signatures and KEMs are separate enums because they serve
/// fundamentally different purposes: signatures prove authenticity
/// (sign/verify over a message), while KEMs establish shared secrets
/// (encapsulate/decapsulate). They have different field requirements
/// (signature data vs. ciphertext/shared secret) and must not be
/// conflated in data structures.
#[repr(u32)]
pub enum SignatureAlgorithm {
    // === Classical (pre-quantum) ===
    Ed25519             = 0x0001,
    Rsa4096Pss          = 0x0002,

    // === Post-quantum (NIST standards) ===
    MlDsa44             = 0x0100,   // FIPS 204, security level 2
    MlDsa65             = 0x0101,   // FIPS 204, security level 3
    MlDsa87             = 0x0102,   // FIPS 204, security level 5
    SlhDsa128f          = 0x0110,   // FIPS 205, fast variant
    SlhDsa128s          = 0x0111,   // FIPS 205, small variant

    // === Hybrid (classical + PQC, transition period) ===
    Ed25519PlusMlDsa44  = 0x0200,   // Both signatures, both must verify
    Rsa4096PlusMlDsa65  = 0x0201,
    Ed25519PlusMlDsa65  = 0x0202,   // Ed25519 + ML-DSA-65 (kernel image default)
}

/// Key Encapsulation Mechanism (KEM) algorithm identifier.
/// Used for key exchange in cluster node authentication (Section 47.2)
/// and any context where shared secrets are established.
/// Separate from SignatureAlgorithm because KEMs produce
/// (ciphertext, shared_secret) pairs, not signatures.
#[repr(u32)]
pub enum KemAlgorithm {
    // === Post-quantum (NIST standards) ===
    MlKem768            = 0x0120,   // FIPS 203, security level 3
    MlKem1024           = 0x0121,   // FIPS 203, security level 5
}
// Note: Hybrid signature algorithm IDs start at 0x0200, which exceeds u8 range.
// Network-portable structures (e.g., DistributedCapability in Section 47.9)
// that carry a sig_algorithm field MUST use at least u16 (or the full u32
// encoding) to represent the complete SignatureAlgorithm ID space.
// Cross-reference: DistributedCapability.sig_algorithm in Section 47.9
// must be widened from u8 to at least u16 to accommodate hybrid algorithms.

/// Variable-length signature.
/// Avoids hardcoding signature size in data structures.
pub struct Signature {
    /// Algorithm that produced this signature.
    pub algorithm: SignatureAlgorithm,
    /// Signature bytes (length depends on algorithm).
    pub data: SignatureData,
}

/// Signature data — inline for classical (hot path), boxed for PQC (cold path).
pub enum SignatureData {
    /// Ed25519: 64 bytes. Fits inline. No allocation.
    /// Used in capability verification cache (hot-path lookups).
    Inline64([u8; 64]),
    /// PQC signatures: heap-allocated via Box<[u8]>.
    /// PQC signatures are 2.4KB–17KB. Heap allocation is acceptable
    /// because PQC signing/verification occurs only on cold paths
    /// (boot, driver load, capability creation, cluster join), all of
    /// which run after the kernel slab allocator is initialized
    /// (post-Phase 2 boot, Section 12.2). During early boot (before
    /// the heap is available), signature verification uses the
    /// fixed-size buffers in KernelSignature and DriverSignature
    /// structs directly, bypassing this type entirely.
    Heap(Box<[u8]>),
}

// Note on cache pressure: cached PQC capabilities are ~2.5KB per entry
// vs 128 bytes for Ed25519. For 1,000 active capabilities: 2.5MB vs 128KB.
// Both fit in L3 cache on any modern system. If the working set grows
// beyond this, LRU eviction of cold capabilities mitigates pressure.

25.3 Impact on Distributed Capabilities

Distributed capabilities (Section 47.9) carry a signature for network verification. With PQC:

Current DistributedCapability:
  Base fields:     ~64 bytes
  Ed25519 signature: 64 bytes
  Total: ~128 bytes per capability

With ML-DSA-44:
  Base fields:     ~64 bytes
  ML-DSA-44 signature: 2,420 bytes
  Total: ~2,484 bytes per capability

Impact:
  - Capability token is ~20x larger.
  - RDMA bandwidth for capability exchange: negligible (capabilities
    are exchanged once and cached, not per-operation).
  - Verification time: Ed25519 verify is ~30-70 μs.
    ML-DSA-44 verify is ~20-100 μs (cached after first verification).
    Verification performance is comparable between Ed25519 and ML-DSA-44
    on modern x86 (optimized ML-DSA implementations can match or beat
    Ed25519 due to efficient matrix-vector operations). Both are
    negligible on cold paths.
  - Memory for cached capabilities: 2.5KB vs 128 bytes per entry.
    For a cluster with 1,000 active capabilities: 2.5 MB vs 128 KB.
    Negligible on any modern system.

25.4 Hybrid Mode (Transition Period)

During the transition period (2025-2035), sign with BOTH classical and PQC algorithms:

Kernel image signature:
  Ed25519 signature: 64 bytes    (verifiable by old bootloaders)
  ML-DSA-65 signature: 3,309 bytes (verifiable by PQC-aware bootloaders)

  Verification responsibility:
  - Pre-ISLE bootloaders (UEFI Secure Boot, GRUB) do NOT understand
    ISLE's KernelSignature format. Backward compatibility requires
    that the kernel image also carry a standard signature in the
    format the bootloader expects (e.g., Authenticode PE signature
    for UEFI Secure Boot, GPG detached signature for GRUB). The
    ISLE KernelSignature (containing hybrid Ed25519 + ML-DSA-65)
    is appended separately and is invisible to these bootloaders.
  - ISLE-aware bootloaders that predate PQC support verify only
    the classical component of the hybrid signature.

    **Hybrid signature wire format**: Hybrid signatures use a
    length-prefixed layout, not a fixed-offset convention. Format:
    `[classical_len: u16 | classical_sig: [u8; classical_len] |
    pqc_sig: [u8; remaining]]`. The verifier reads `classical_len`
    to split the signature buffer. For Ed25519 + ML-DSA-65 hybrids,
    `classical_len = 64`. For RSA-4096-PSS + ML-DSA-65 hybrids,
    `classical_len = 512`. This format accommodates any classical
    algorithm size without breaking the wire protocol.

    An older ISLE-aware verifier that does not recognize algorithm
    ID 0x0200 reads `classical_len` from the first two bytes of the
    signature buffer, extracts `classical_sig[0..classical_len]`,
    and verifies that component only, ignoring the trailing PQC
    signature data. To enable this, older ISLE-aware verifiers MUST
    treat any algorithm ID in the range 0x0200-0x02FF as "hybrid:
    extract classical_len from bytes [0..2), verify classical_sig
    from bytes [2..2+classical_len)". Legacy verifiers that only
    understand algorithm IDs in the range 0x0000-0x00FF ignore the
    hybrid field entirely. Algorithm IDs outside recognized ranges
    cause verification to fail (unknown algorithm = untrusted).
  - The ISLE boot stub, once loaded, verifies BOTH the Ed25519 and
    ML-DSA signatures before proceeding. If either fails, boot halts.

  Algorithm unavailability:
  - If hardware acceleration for one algorithm is unavailable, software
    fallback is used. Both signatures must always verify — there is no
    single-algorithm fallback mode, as this would defeat the purpose of
    hybrid signatures. Systems that cannot verify both algorithms must
    refuse to boot.

Driver signature (.kabi_sig section):
  Ed25519: 64 bytes
  ML-DSA-44: 2,420 bytes
  Total overhead per driver binary: ~2.5 KB. Negligible.

Cluster capability signature:
  Ed25519PlusMlDsa44: verify both. Total verify time: ~50-170 μs.
  Cached after first verification. Zero hot-path impact.

25.5 Linux Compatibility

The Linux crypto API (/proc/crypto, AF_ALG sockets) is the standard interface for userspace access to kernel cryptography. PQC algorithms are registered as additional entries:

$ cat /proc/crypto | grep -A2 ml-dsa
name         : ml-dsa-44
driver       : ml-dsa-44-generic
module       : kernel
type         : akcipher

Existing Linux tools (cryptsetup, dm-crypt, etc.) use the crypto API.
Adding PQC algorithms is additive — existing algorithms remain available.

PQC Signature Size and IPC Ring Buffers:

PQC signatures (2.4KB–17KB) are much larger than Ed25519 (64 bytes). However, IPC ring buffers (Section 8) do not carry signatures — capabilities are verified at connect() time, not per-message. The SignatureData::Heap variant handles large PQC signatures without affecting ring buffer sizing. No ring buffer resize is needed.

Side-Channel Resistance:

All PQC implementations in the kernel MUST be constant-time. Variable-time operations leak secret key material through timing and cache side channels. CI enforcement: ctgrind-style testing (Valgrind memcheck with secret memory marked as undefined) runs on every commit touching crypto code. Non-constant-time code paths are rejected.

Benchmark Reference:

Approximate times on a modern x86 core (single-threaded, AVX2-enabled). References: ed25519-dalek 4.x, pqcrypto-dilithium (AVX2), liboqs 0.10+: - Ed25519: ~33-50 μs (verify), ~15-20 μs (sign) - ML-DSA-44: ~50-120 μs (verify), ~80-200 μs (sign) - ML-DSA-65: ~100-200 μs (verify), ~250-460 μs (sign) - ML-KEM-768: ~90 μs (encapsulate), ~80 μs (decapsulate)

Ranges reflect implementation quality (optimized AVX2 vs. reference C) and CPU microarchitecture. ML-DSA-44 verification is typically 2-3x slower than Ed25519 in absolute terms, though both are fast enough to be negligible on cold-path operations (boot, driver load, cluster join). The performance difference is irrelevant for operations that occur at most a few times per second.

25.6 Performance Impact

Zero hot-path impact. Cryptographic operations occur only on cold paths: - Boot: once (kernel signature verification) - Driver load: once per driver (~50 drivers at boot) - Cluster join: once per node - Capability creation: infrequent (not per-operation)

Ed25519 verification (~30-70 μs) and ML-DSA-44 verification (~20-100 μs) are comparable on modern hardware; ML-DSA-65 (~100-200 μs) is somewhat slower. All are fast enough for cold-path operations. ML-KEM encapsulation is faster than ECDH. The performance difference is negligible for operations that occur once per boot, driver load, or cluster join.

25.7 PQC Key Management

PQC private keys must be stored securely. TPM 2.0 PQC support is emerging: TCG TPM 2.0 Library Specification 2.0 (V184, March 2025) is the latest published revision. V185 RC4 (December 2025) adds PQC algorithm profiles and completed public review on February 10, 2026. Until V185 is formally published, it should be treated as near-final draft, not as a normative reference. Initial PQC-capable TPM hardware samples were announced in late 2025 (e.g., SEALSQ). Production availability expected 2026-2027.

Key storage options (in priority order):

1. TPM 2.0 with PQC profile (near-term):
   TCG TPM 2.0 V184 (March 2025, latest published revision) is the
   normative reference. V185 RC4 (December 2025) adds PQC algorithm
   profiles and completed public review on February 10, 2026.
   TPM stores ML-DSA/ML-KEM private keys in hardware. Key never
   leaves TPM. This is the ideal solution.
   Status: V184 published; V185 (PQC profiles) near-final after RC4.
   PQC-capable TPM hardware samples announced (SEALSQ, late 2025).
   Production availability expected 2026-2027.

2. HSM / external key store:
   Enterprise deployments use network HSMs (Thales Luna, AWS CloudHSM)
   that already support ML-DSA/ML-KEM. Kernel communicates via PKCS#11
   or vendor-specific interface. Key never in kernel memory.

3. UEFI Secure Variable storage:
   Private key stored in a custom UEFI authenticated variable
   (NOT db/dbx — those store public keys and certificates for
   signature verification). Protected by Secure Boot chain: only
   authenticated updates can modify the variable. Key loaded into
   kernel memory at boot, used for signing, then zeroed. Vulnerable
   to cold-boot attacks. Acceptable for non-TEE systems.

4. Software key in kernel memory (fallback):
   IMPORTANT: Only the VERIFICATION (public) key is embedded in
   kernel .rodata. This is standard practice (Linux embeds its
   module signing public key the same way). The public key allows
   the kernel to verify signatures but cannot create them. Exposure
   of the public key via a memory read vulnerability does NOT
   compromise the signing process — an attacker who reads the
   public key can verify signatures (which is public information)
   but cannot forge them. The PRIVATE (signing) key must NEVER
   be present in the running kernel image. It exists only in the
   build environment (offline signing server, HSM, or developer
   workstation) and is used during the build process to produce
   signatures that are then appended to kernel/driver images.
   For cluster identity keys (which require a private key at
   runtime for mutual authentication), the private key is loaded
   from disk at boot and held in kernel memory. Protected by:
   - Kernel ASLR (address space randomization)
   - Confidential computing (if running in TEE, key is hardware-encrypted)
   - Memory zeroing on key rotation
   Weakest option for runtime private keys. Acceptable for
   development and non-critical deployments.

Root of trust for verified boot (Section 22):
  Phase 1: Classical Ed25519 public key in UEFI db (existing infrastructure).
  Phase 2: Hybrid Ed25519 + ML-DSA-65 public keys in UEFI db.
  Phase 3: ML-DSA-65 key in TPM 2.0 PQC profile (when available).
  Each phase is a strict superset — old keys continue to work.

26. Confidential Computing

26.1 Why This Cannot Wait

AMD SEV-SNP, Intel TDX, and ARM CCA are shipping today. Every major cloud provider offers confidential VMs. The fundamental shift: the kernel becomes untrusted by its own workloads.

Traditional trust model (current design):
  Hardware → Kernel → Processes
  The kernel can read all process memory.
  The kernel is the root of trust.

Confidential computing trust model:
  Hardware → TEE (Trusted Execution Environment) → Workload
  The kernel CANNOT read TEE memory. Hardware enforces encryption.
  The kernel is OUTSIDE the trust boundary.
  The workload trusts only hardware + its own code.

This directly conflicts with several kernel subsystems if not designed in:

Subsystem	Conflict	Resolution
Page migration (HMM, Section 43)	Kernel can't read encrypted pages to copy them	Hardware assists migration (SEV-SNP page copy with re-encryption)
Memory compression (Section 13)	Kernel can't compress what it can't read	Skip compression for confidential pages (metadata-only eviction)
DSM (Section 47.5)	Can't RDMA encrypted pages — remote node can't decrypt	TEE-to-TEE RDMA with shared key negotiation
In-kernel inference (Section 45)	Can't observe page content of encrypted memory; access patterns (page faults, accessed/dirty bits) remain fully observable (Section 26.7)	Train models on access patterns and scheduling metadata only; content-based features unavailable for TEE workloads (~10-20% quality reduction, Section 26.7)
Crash recovery (Section 9)	Can't inspect process state to reconstruct after driver crash	Preserve TEE context across driver reload (hardware feature)
FMA telemetry (Section 39)	Health events may leak information about confidential workloads	Aggregate-only telemetry for confidential contexts

26.2 Architectural Requirements

// isle-core/src/security/confidential.rs

/// Confidential computing context.
/// Attached to a process or VM that runs inside a TEE.
pub struct ConfidentialContext {
    /// Hardware TEE type.
    pub tee_type: TeeType,

    /// Hardware-managed encryption key ID.
    /// Each TEE context has a unique key. The kernel never sees the key.
    /// The hardware memory controller encrypts/decrypts transparently.
    pub key_id: u32,

    /// VMPL level (0-3). Only meaningful when tee_type == AmdSevSnpVmpl.
    /// 0 = paravisor/SVSM (most privileged), 1+ = guest OS.
    /// For non-VMPL TEE types, this field is 0 and ignored.
    pub vmpl_level: u8,

    /// Attestation state.
    pub attestation: AttestationState,

    /// Memory policy: how the kernel handles this context's pages.
    pub memory_policy: ConfidentialMemoryPolicy,
}

#[repr(u32)]
pub enum TeeType {
    /// No TEE. Standard process. Kernel can access all memory.
    None        = 0,
    /// AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging).
    /// Implicitly runs at VMPL0 (most privileged level within the VM).
    AmdSevSnp   = 1,
    /// Intel TDX (Trust Domain Extensions).
    IntelTdx    = 2,
    /// ARM CCA (Confidential Compute Architecture).
    ArmCca      = 3,
    /// AMD SEV-SNP with VMPL (Virtual Machine Privilege Level) at a
    /// non-default level. VMPL allows multiple trust levels within a
    /// single SEV-SNP VM: VMPL0 = paravisor/SVSM (most privileged),
    /// VMPL1+ = guest OS. Required for AMD's SVSM architecture.
    /// The VMPL level is stored in ConfidentialContext.vmpl_level (u8).
    AmdSevSnpVmpl = 4,
}

// Note: The VMPL level for AmdSevSnpVmpl is stored in ConfidentialContext
// as a separate field, not embedded in the enum variant. This preserves
// #[repr(u32)] compatibility (Rust requires all variants of a repr(inttype)
// enum to be unit variants with explicit discriminants).

#[repr(u32)]
pub enum ConfidentialMemoryPolicy {
    /// All pages are encrypted. Kernel cannot read content.
    /// Kernel can still manage page tables, migration metadata.
    FullEncryption  = 0,
    /// Shared pages (explicitly marked by guest) are readable by kernel.
    /// Private pages are encrypted. Used for I/O buffers.
    HybridShared    = 1,
}

/// Attestation lifecycle state.
pub enum AttestationState {
    /// No attestation attempted yet.
    Unattested,
    /// Challenge issued, awaiting hardware report.
    PendingChallenge { nonce: [u8; 32] },
    /// Successfully attested. Report can be verified by remote party.
    Attested {
        /// Hardware-generated attestation report (opaque, TEE-specific).
        /// Attestation report sizes vary by platform:
        /// AMD SEV-SNP: ~1,184 bytes (ATTESTATION_REPORT structure),
        /// Intel TDX: 1,024 bytes (TDREPORT),
        /// ARM CCA: Realm token (Realm Measurement and Attestation Token, RMAT)
        /// as defined in the ARM CCA Security Model specification.
        /// The 4,096-byte buffer provides headroom for all platforms
        /// including any future format extensions.
        /// Only report_len bytes are meaningful.
        report: [u8; 4096],
        /// Number of valid bytes in the report buffer.
        report_len: u16,
        /// SHA-384 hash of the report (for quick identity checks).
        report_hash: [u8; 48],
        /// When attestation completed (monotonic clock).
        timestamp: u64,
    },
    /// Attestation failed (hardware error or measurement mismatch).
    Failed { reason: AttestationError },
}

#[repr(u32)]
pub enum AttestationError {
    /// Hardware does not support attestation.
    HardwareUnsupported = 0,
    /// Measurement does not match expected value.
    MeasurementMismatch = 1,
    /// Hardware reported an internal error.
    HardwareError       = 2,
    /// Certificate chain validation failed.
    CertificateInvalid  = 3,
}

Attestation Verification:

Hardware-rooted attestation allows remote parties to verify the integrity of a confidential workload. AMD SEV-SNP uses a Versioned Chip Endorsement Key (VCEK) signed by AMD's Key Distribution Service (KDS). Intel TDX uses a Quoting Enclave (QE) that produces quotes verifiable via Intel's Provisioning Certification Service (PCS). ARM CCA uses platform attestation tokens signed by the CCA platform key.

ISLE exposes Linux-compatible attestation device nodes: - /dev/sev-guest — AMD SEV-SNP guest attestation reports - /dev/tdx-guest — Intel TDX guest attestation reports (via TDX_CMD_GET_REPORT)

These are implemented in isle-compat with identical ioctl semantics to Linux.

26.3 Design Approach: Opaque Page Handles

The key design principle: kernel subsystems must be able to manage pages without reading their contents. Every subsystem that touches page data needs a "confidential-aware" path.

// isle-core/src/mem/page.rs

/// A physical page handle. The kernel always has this.
/// Whether the kernel can read the page DATA depends on the page's
/// confidentiality state.
pub struct PageHandle {
    /// Physical frame number.
    pub pfn: u64,

    /// Ownership and state metadata (kernel can always read this).
    pub state: PageState,

    /// Is this page's content readable by the kernel?
    /// false for pages owned by a ConfidentialContext.
    pub kernel_readable: bool,

    /// For confidential pages: hardware key ID for re-encryption
    /// during migration. The kernel uses this to instruct hardware
    /// to re-encrypt the page for the destination context.
    pub encryption_key_id: Option<u32>,
}

Subsystem-by-subsystem adaptation:

Memory compression (Section 13):
  if page.kernel_readable:
    compress normally (LZ4/zstd)
  else:
    skip compression — evict to swap encrypted
    (hardware encrypts at rest, no kernel involvement)

Page migration (Section 43, GPU VRAM):
  if page.kernel_readable:
    DMA copy (standard path)
  else:
    hardware-assisted encrypted copy
    (SEV-SNP: SNP_PAGE_MOVE firmware command via PSP, TDX: TDH.MEM.PAGE.RELOCATE SEAMCALL)

DSM (Section 47.5):
  if both endpoints are in the same TEE trust domain:
    Key negotiation protocol:
      1. Both nodes produce attestation reports (hardware-rooted).
      2. Reports are mutually verified (each node checks the other's
         measurement, firmware version, security policy).
      3. Key exchange via ML-KEM-768 (PQC, §25) authenticated by
         attestation-bound identity.
      4. Shared symmetric key established for RDMA encryption.
      5. RDMA transfers encrypted with shared key (AES-GCM).
    Latency: ~5ms for initial key negotiation (once per node pair).
    Steady-state RDMA: same performance as non-TEE (hardware AES).
    Key rotation: every 1 hour or 2^32 RDMA operations (whichever
    comes first). Nonce construction: 4-byte connection ID || 8-byte
    per-connection counter (deterministic, never reused within a key
    lifetime). The 2^32 limit is a conservative policy: counter-based
    nonces can safely reach 2^64, but frequent rotation limits the
    exposure window of any single key.
    Re-attestation: every 24 hours or on firmware update. If
    re-attestation fails (measurement changed), the shared key is
    revoked and all RDMA connections to that node are torn down.
    DSM pages cached from that node are invalidated (Section 47.5).
  else (different trust domains):
    pages must be re-encrypted for each endpoint
    → higher latency, but functionally correct

Hardware memory tagging (MTE, §19):
  MTE tags are stored in a separate physical memory region (tag RAM).
  For TEE-encrypted pages, tag RAM may also be encrypted.
  The kernel cannot set/clear tags on confidential pages.
  Policy: confidential pages are allocated UNTAGGED (tag = 0).
  MTE checking is disabled for pages owned by a ConfidentialContext.
  Rationale: MTE protects against kernel bugs accessing the page.
  For confidential pages, hardware encryption already prevents
  kernel access — MTE is redundant.

In-kernel inference (Section 45):
  Can observe: page fault frequency, access pattern (from page table
  accessed bits, not page content), memory pressure signals.
  Cannot observe: page content.
  → Page prefetching model trains on access patterns, not content.
  → Works fine. Content is irrelevant for prefetch decisions.

26.4 Guest Mode: ISLE as a Confidential Guest

When ISLE runs inside a confidential VM (as a guest on a hypervisor):

ISLE as TDX/SEV-SNP guest:
  1. All guest physical memory is encrypted by hardware.
  2. The hypervisor cannot read guest memory.
  3. ISLE must:
     a. Use SWIOTLB (bounce buffer) for device I/O with the hypervisor.
     b. Explicitly mark shared pages for virtio/MMIO communication.
     c. Validate all data received from the hypervisor (untrusted host).
     d. Support remote attestation so users can verify the guest.

  This requires:
  - Memory manager: distinguish "private" (encrypted) vs "shared" (plaintext) pages.
  - I/O path: bounce buffers for virtio when running as confidential guest.
  - Attestation: kernel provides attestation report via sysfs/ioctl.

26.5 Host Mode: ISLE Hosting Confidential VMs

When ISLE is the host running isle-kvm with confidential guests:

ISLE as TDX/SEV-SNP host:
  1. isle-kvm (Tier 1 driver, Section 6) creates confidential VM contexts.
  2. Kernel allocates encrypted physical pages for guest.
  3. Kernel CANNOT read guest memory (hardware enforces).
  4. Kernel CAN:
     - Schedule vCPUs on physical CPUs (scheduling, Section 14)
     - Manage guest memory mappings (page tables, metadata only)
     - Migrate guest pages between NUMA nodes (hardware re-encryption)
     - Apply cgroup limits (memory, CPU, accelerator)
  5. Kernel CANNOT:
     - Read guest page contents
     - Inject code into guest
     - Modify guest register state (except via controlled VM entry/exit)

26.6 Linux Compatibility

Existing Linux confidential computing interfaces:

/dev/sev (AMD SEV-SNP):
  KVM_SEV_INIT, KVM_SEV_LAUNCH_START, KVM_SEV_LAUNCH_UPDATE,
  KVM_SEV_LAUNCH_MEASURE, KVM_SEV_LAUNCH_FINISH, etc.
  → Implemented in isle-kvm, same ioctl numbers.

/dev/tdx_guest (Intel TDX):
  TDX_CMD_GET_REPORT, TDX_CMD_EXTEND_RTMR, etc.
  → Implemented in isle-compat, same ioctl numbers.

KVM ioctls (generic):
  KVM_CREATE_VM, KVM_SET_MEMORY_ATTRIBUTES (private vs shared), etc.
  → Implemented in isle-kvm.

QEMU, libvirt, cloud-hypervisor: all use these ioctls.
Binary compatibility preserved.

26.7 TEE Observability Degradation Model

When workloads run inside TEEs, some kernel optimization signals are unavailable. This is an inherent hardware constraint, not a software limitation. The kernel must degrade gracefully, not silently fail.

Signal                       Non-TEE    TEE (host observing guest)    Notes
────────────────────────────  ─────────  ───────────────────────────  ──────────────
Page fault patterns           ✓ Full     ✓ Full                      Kernel manages page tables
Page table accessed/dirty     ✓ Full     ✓ Full                      Hardware sets bits, kernel reads
I/O request patterns          ✓ Full     ✓ Full                      I/O goes through kernel
CPU scheduling (PELT)         ✓ Full     ✓ Full                      Kernel schedules vCPUs
Memory allocation patterns    ✓ Full     ✓ Full                      Kernel allocates pages
Memory pressure signals       ✓ Full     ✓ Full                      Kernel tracks pressure
Page content (for compress)   ✓ Full     ✗ Unavailable               Hardware encryption
Hardware perf counters        ✓ Full     ◐ Restricted                TDX: host reads limited set
                                                                     SEV-SNP: guest-only by default
                                                                     CCA: realm-restricted
In-kernel inference           ✓ Full     ◐ Reduced                   Trains on access patterns (ok)
                                                                     Cannot use content features
Memory compression            ✓ Full     ✗ Disabled                  Can't compress encrypted pages
MTE tagging                   ✓ Full     ✗ Disabled                  Tags on confidential pages = 0

Degraded features for TEE workloads:
  - Memory compression: disabled (pages evicted encrypted, no compression gain)
  - Learned prefetch: works on access patterns (page faults, accessed bits)
    but quality may be ~10-20% lower than non-TEE due to missing perf counters
  - Intent I/O scheduling: works (sees I/O requests), same quality as non-TEE
  - Power budgeting: works (RAPL is host-side, unaffected by TEE)
  - MTE safety: disabled for TEE pages (hardware encryption is the safety mechanism)

This degradation is identical to Linux. Linux has the same restricted visibility into TEE guests. No kernel can observe hardware-encrypted memory contents.

26.8 Performance Impact

When confidential computing is not used: zero overhead. No code runs, no checks happen. The page.kernel_readable flag is always true. All code paths take the standard branch.

When confidential computing is used: same overhead as Linux. The cost is hardware memory encryption (~1-5% memory bandwidth reduction), which is identical regardless of kernel implementation.

26.9 Confidential VM Live Migration

Confidential VMs require special migration handling because the hypervisor cannot read encrypted guest memory.

AMD SEV-SNP Migration:

The Platform Security Processor (PSP) manages migration. The source PSP exports encrypted guest pages with a transport key negotiated between source and destination PSPs. The destination allocates a new ASID and imports pages, re-encrypting with the destination's memory encryption key. Guest state (VMSA) is included in the encrypted export. The guest observes a brief pause but no data loss.

Intel TDX Migration:

TDX uses TD-Preserving migration via SEAMCALL instructions (TDH.EXPORT.MEM, TDH.IMPORT.MEM, TDH.EXPORT.STATE.IMMUTABLE, TDH.IMPORT.STATE.IMMUTABLE, TDH.EXPORT.STATE.VP, TDH.IMPORT.STATE.VP). The source TDX module exports TD pages (via TDH.EXPORT.MEM), immutable TD-scope metadata (via TDH.EXPORT.STATE.IMMUTABLE), and per-vCPU state (via TDH.EXPORT.STATE.VP) in an encrypted migration stream. The destination TDX module imports and re-keys via the corresponding IMPORT SEAMCALLs. Migration is transparent to the TD guest.

ARM CCA Migration:

CCA realm migration is not yet specified by ARM (as of 2026). The architecture reserves a stub for realm migration; implementation is deferred until the ARM specification is published.

ASID/Key Exhaustion:

isle-kvm tracks hardware encryption key IDs (ASIDs for SEV-SNP, HKIDs for TDX). When all IDs are allocated, KVM_CREATE_VM with confidential flags returns -ENOSPC. The administrator must terminate existing confidential VMs to free IDs. Typical limits: SEV-SNP ASIDs are platform-dependent (e.g., 509 on Milan, 1006+ on Genoa), discovered at runtime via CPUID Fn8000001F[ECX]; TDX supports ~64 HKIDs (hardware-dependent).

Note for reviewers: The Genoa ASID count (1006) is higher than Milan (509) because Genoa's PSP (Platform Security Processor) has an expanded key cache. These values are per AMD's SEV-SNP API specification — CPUID Fn8000001F[ECX] returns the actual count. Do not flag "509 vs 1006" as an inconsistency; both are correct for their respective processor generations.

GHCB Protocol (SEV-ES/SNP):

SEV-ES and SEV-SNP guests cannot execute VMEXIT normally (host cannot read guest registers). The Guest-Hypervisor Communication Block (GHCB) is a shared page where the guest places exit information for the hypervisor. isle-kvm implements the GHCB protocol (v2) for handling #VC (VMM Communication) exceptions.

TDX TDCALL:

TDX guests communicate with the TDX module via TDCALL instruction (not VMEXIT). isle-kvm's TDX backend handles TDG.VP.VMCALL for guest-initiated exits.