Skip to content

Sections 17–19 of the ISLE Architecture. For the full table of contents, see README.md.


Part IV: Boot and Hardware

Boot process, supported architectures, and hardware memory safety features.


17. Boot and Installation

17.1 Overview

ISLE uses a phased boot architecture. The current implementation boots via the Multiboot1 protocol through GRUB or QEMU's -kernel flag — sufficient for development, testing, and early hardware bring-up. The production target is UEFI stub boot with Linux boot protocol compatibility, enabling drop-in package installation alongside existing Linux kernels.

The boot code lives in isle-kernel/src/boot/ (assembly entry, Multiboot parser) and isle-kernel/src/arch/*/boot.rs (per-architecture boot routines). The initialization sequence is in isle-kernel/src/main.rs.

17.2 Current Implementation: Multiboot Boot

17.2.1 Boot Protocols

The kernel ELF contains dual Multiboot headers — both Multiboot1 and Multiboot2 are present in the binary, allowing either protocol at the bootloader's choice:

  • Multiboot1 (magic 0x1BADB002): Fully implemented. Used by QEMU (-kernel flag) and GRUB (multiboot command). Parser in boot/multiboot1.rs extracts the memory map, command line, and bootloader name.
  • Multiboot2 (magic 0xE85250D6): Header present in the ELF but no parser implemented. The magic is recognized in isle_main() but the info structure is not parsed. Planned for Phase 2.

The linker script (linker-x86_64.ld) places headers in dedicated sections: .multiboot1 (4-byte aligned, first 8 KB) and .multiboot2 (8-byte aligned, first 32 KB), ensuring bootloaders find them. The kernel loads at physical address 0x100000 (1 MB), the standard Multiboot load address.

Build and boot methods:

# Development: QEMU with -kernel (Multiboot1, no ISO needed)
qemu-system-x86_64 -kernel target/x86_64-unknown-none/release/isle-kernel -serial stdio

# Testing: GRUB ISO boot (Multiboot1 via grub.cfg `multiboot` command)
make iso && qemu-system-x86_64 -cdrom target/isle-kernel.iso -serial stdio

Non-x86 architectures use different boot protocols:

  • Device Tree Blob (DTB): Used by AArch64, ARMv7, RISC-V 64, PPC32, and PPC64LE. The firmware or QEMU passes a pointer to a flattened device tree (FDT) in a register at entry (x0 on AArch64, r2 on ARMv7, a1 on RISC-V, r3 on PPC32, r3 on PPC64LE). The DTB describes the machine's physical memory layout, interrupt controllers, timers, and peripheral addresses. The format is big-endian with magic 0xD00DFEED. See §17.2.7 for the parsing specification.
  • OpenSBI (RISC-V only): The Supervisor Binary Interface firmware runs in M-mode and provides SBI ecalls for timer, IPI, console, and system reset services to S-mode code. QEMU's built-in OpenSBI occupies physical addresses 0x800000000x801FFFFF. At entry, OpenSBI passes a0 = hart_id (hardware thread identifier) and a1 = DTB address. The kernel must not overwrite the OpenSBI region.
  • OpenFirmware / SLOF (PPC64LE): On POWER systems, SLOF (Slimline Open Firmware) or OPAL (OpenPOWER Abstraction Layer) firmware initializes hardware and passes a DTB pointer in r3. QEMU's pseries machine uses SLOF; bare metal POWER8/9/10 uses OPAL (skiboot). At entry: r3 = DTB address, r4 = 0 (reserved). The kernel runs in hypervisor or supervisor mode.
  • U-Boot / OpenFirmware (PPC32): Embedded PowerPC boards typically use U-Boot which passes a DTB pointer in r3. QEMU's ppce500 machine uses U-Boot or direct kernel boot. At entry: r3 = DTB address, r4 = kernel_start, r5 = 0 (reserved).

17.2.2 x86-64 Entry Sequence

The boot assembly (boot/entry.asm, NASM syntax) handles the transition from 32-bit protected mode to 64-bit long mode:

1. GRUB/QEMU loads ELF at 1 MB, jumps to _start in 32-bit protected mode
   - eax = Multiboot1 magic (0x2BADB002)
   - ebx = pointer to Multiboot info structure

2. _start (32-bit):
   a. Save magic: eax → esi (preserved across BSS clear and CPUID check)
   b. Set temporary stack at 0x80000 (below kernel)
   c. Clear BSS (rep stosd from __bss_start to __bss_end — clobbers edi, ecx, eax)
   d. Build identity-map page tables for first 1 GB:
      PML4[0] → boot_pdpt | PRESENT | WRITABLE
      PDPT[0] → boot_pd   | PRESENT | WRITABLE
      PD[0..511] → 512 × 2 MB pages (flags: PRESENT | WRITABLE | PAGE_SIZE)
   e. Save info ptr: ebx → ebp (preserve across CPUID — ebx is clobbered by CPUID)
   f. Verify long mode: CPUID leaf 0x80000001 bit 29
      (displays "NO64" on VGA buffer and halts if not available)
   g. Restore info ptr: ebp → ebx
   h. Enable PAE (CR4 bit 5)
   i. Enable Long Mode (IA32_EFER MSR bit 8)
   j. Enable Paging (CR0 bit 31)
   k. Load temporary 64-bit GDT (null + code + data descriptors)
   l. Far jump to _start64 (selector 0x08 = 64-bit code segment)

3. _start64 (64-bit):
   a. Load 64-bit data segments (selector 0x10)
   b. Set kernel stack (boot_stack_top, 16 KB in .bss)
   c. Clear RFLAGS
   d. Map to 64-bit calling convention: edi = esi (magic), esi = ebx (info ptr)
   e. Call isle_main(multiboot_magic=rdi, multiboot_info_ptr=rsi)

Page tables and boot stack are allocated in .bss (zeroed by step 2c): boot_pml4 (4 KB), boot_pdpt (4 KB), boot_pd (4 KB), boot stack (16 KB).

17.2.3 Kernel Initialization Phases (x86-64)

isle_main() detects the boot protocol from the magic value, then runs an ordered initialization sequence. Each phase depends on the previous:

Phase 1:  GDT + TSS
          Load a proper GDT with TSS. Configure IST1 with a dedicated
          16 KB stack for double-fault handling.

Phase 2:  IDT + PIC
          Install exception handlers (0-31) and IRQ handlers (32-47).
          Remap the 8259 PIC: IRQ0 → vector 32, IRQ8 → vector 40.

Phase 3:  Physical Memory Manager
          Parse Multiboot1 memory map (see §17.2.9).
          Initialize bitmap allocator: mark available regions free,
          reserve first 1 MB (BIOS/legacy) and kernel image.

Phase 4:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB total).
          Initialize free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 12).

Phase 5:  Virtual Memory
          Verify identity mapping (virt_to_phys on mapped addresses).
          Test new page mappings: allocate frame, map at 0x40000000,
          write/read volatile, unmap, free frame.

Phase 6:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 7:  IPC / MPK Detection
          Query CPUID for PKU support. Test domain alloc/free.

Phase 8:  Enable Interrupts
          Enable IRQs, verify timer ticks are incrementing.

Phase 9:  Scheduler
          Initialize round-robin scheduler. Spawn two test threads
          (thread-A, thread-B). Run cooperative yield loop, then
          enable preemptive scheduling via timer tick callback.

Phase 10: SYSCALL/SYSRET
          Configure STAR/LSTAR/SFMASK MSRs. Register three syscall
          handlers: write(1), getpid(39), exit_group(231).
          Test with inline SYSCALL instruction from kernel mode.

17.2.3.1 Secondary CPU Bringup (x86-64 SMP)

After Phase 10 completes on the BSP (Boot Strap Processor), secondary CPUs (Application Processors, APs) are brought online:

Phase 11: AP Detection
          Query ACPI MADT (Multiple APIC Descriptor Table) or MP Table
          for CPU count and LAPIC IDs. Allocate PerCpu<T> slots for
          each detected CPU.

Phase 12: AP Trampoline Setup
          a. Allocate a 4KB page below 1MB (in low memory, identity-mapped)
             for the AP trampoline code. This is required because APs start
             in real mode (16-bit) with paging disabled.
          b. Copy trampoline code (16-bit → 32-bit → 64-bit transition) to
             the low-memory page. The trampoline:
             - Starts in 16-bit real mode at physical address 0xNN00
             - Enables protected mode (32-bit)
             - Loads a temporary GDT (same layout as BSP's)
             - Enables long mode (64-bit)
             - Loads CR3 with the kernel's page tables
             - Jumps to ap_entry() in high memory
          c. The trampoline communicates with the BSP via a "mailbox":
             - ap_startup_vector: LAPIC ID of AP to start
             - ap_cr3: Page tables to load
             - ap_stack: Per-CPU stack pointer
             - ap_entry: 64-bit entry point
             - ap_ready: Set to 1 by AP when in long mode
             - ap_cpu_id: CPU index (not LAPIC ID) assigned by BSP

Phase 13: AP Startup (per CPU, parallel)
          For each AP detected in Phase 11:
          a. BSP sets up the mailbox for this AP (stack, cr3, entry point)
          b. BSP sends INIT IPI to AP's LAPIC (assert level)
          c. BSP waits 10ms (Intel recommendation)
          d. BSP sends STARTUP IPI (SIPI) with trampoline vector
          e. BSP waits 200μs
          f. BSP sends second SIPI (required by old processors)
          g. BSP polls ap_ready until timeout (1 second) or AP signals ready
          h. If timeout: log error, mark CPU as failed, continue
          i. If ready: AP is now in ap_entry() in 64-bit mode

Phase 14: AP Initialization (per AP, in ap_entry())
          Each AP runs this sequence independently after signaling ready:
          a. Load proper GDT and TSS (per-CPU TSS required for IST stacks)
          b. Load IDT (same as BSP)
          c. Enable interrupts
          d. Initialize per-CPU scheduler runqueue
          e. Calibrate LAPIC timer (delay calibration loop)
          f. Signal "online" to BSP
          g. Enter scheduler idle loop (hlt + monitoring for work)

Phase 15: SMP Online
          BSP waits for all APs to signal online (or timeout).
          System is now fully multi-CPU. Scheduler load-balances
          across all online CPUs.

Per-CPU data initialization: Each AP needs its own per-CPU data structures initialized: - PerCpu<T> slots for scheduler runqueue, current task pointer, etc. - GDT with per-CPU TSS (TSS must be unique per CPU for IST stacks) - LAPIC timer calibration (varies per CPU due to manufacturing differences) - IRQ affinity: By default, all IRQs target BSP; distribute to other CPUs via IOAPIC redirection table or LAPIC logical destination mode.

ACPI MADT parsing (x86-64):

MADT (Multiple APIC Descriptor Table):
  - Located via RSDP → RSDT/XSDT → MADT signature "APIC"
  - Provides: Local APIC address, CPU LAPIC IDs, IOAPIC addresses
  - CPU entries: LAPIC ID, flags (enabled/disabled)
  - Override entries: IRQ source overrides, NMI sources

The BSP's LAPIC ID is read from LAPIC_ID register (MMIO at 0xFEE00020).
All other entries in MADT are APs.

Failure handling: If an AP fails to start within timeout: - Log the failure with LAPIC ID and CPU index - Mark the CPU slot as failed (not online) - Continue boot with reduced CPU count - Do NOT panic — single-CPU operation is valid

Hot-plug support (future): The ACPI namespace may indicate CPU hot-plug capability. The mailbox mechanism is reused for hot-plug: writing to the ACPI CPU hot-plug register triggers the same INIT/SIPI sequence.

17.2.4 AArch64 Boot Sequence

QEMU's -M virt -cpu cortex-a72 -kernel loads the ELF at 0x40080000 and enters at _start in EL1 (Exception Level 1) with the MMU off. Register x0 holds the DTB address provided by QEMU's built-in firmware.

Entry assembly (arch/aarch64/entry.S, GNU as syntax):

1. QEMU jumps to _start in EL1, MMU off
   - x0 = DTB address (passed by QEMU firmware)

2. _start:
   a. Save DTB pointer: mov x19, x0 (x19 is callee-saved)
   b. Disable all exceptions: msr daifset, #0xf
      (masks Debug, SError, IRQ, FIQ in DAIF register)
   c. Enable FPU/NEON: write CPACR_EL1.FPEN bits [21:20] = 0b11
      (without this, any NEON/FP instruction traps — Rust generates
      NEON instructions by default for aarch64). This clobbers x0,
      but the DTB pointer was saved to x19 in step (a).
   d. Load stack pointer: adrp x1, _stack_top / add / mov sp, x1
      (64 KB stack in .bss._stack, 16-byte aligned)
   e. Clear BSS: zero memory from __bss_start to __bss_end
      (str xzr loop, 8 bytes per iteration)
   f. Prepare arguments: x0 = 0 (no multiboot), x1 = x19 (DTB address)
   g. Branch: bl isle_main
   h. Halt loop: wfe (wait-for-event) if isle_main returns

Stack (64 KB) is allocated in .bss._stack (16-byte aligned). The linker script (linker-aarch64.ld) places .text._start first and provides __bss_start / __bss_end symbols for BSS clearing.

Initialization phases (in isle_main(), sequential):

Phase 1:  Exception Vectors (VBAR_EL1)
          Write vector table base to VBAR_EL1 (16 entries × 128 bytes,
          2 KB aligned). Vectors cover: Synchronous, IRQ, FIQ, SError
          at each of four exception origins (current EL SP0/SPx, lower
          EL AArch64/AArch32).

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly, same
          pattern as x86 entry.asm step 2d). Perform any additional
          initialization that depends on zeroed static data.

Phase 3:  DTB Parse
          Parse the DTB (received in x0 at entry, forwarded as the
          info pointer to isle_main; see §17.2.7). Extract /memory
          regions, /chosen bootargs, interrupt controller base (GIC),
          timer IRQ numbers, and UART base address.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Mark available
          regions free, reserve kernel image (__bss_end and below).
          No legacy BIOS region to reserve (unlike x86).

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 12).

Phase 6:  Virtual Memory (TTBR0_EL1)
          Build identity-map page tables using 4 KB granule:
          - TCR_EL1: T0SZ=16 (48-bit VA), TG0=0b00 (4 KB granule),
            ORGN0/IRGN0 = write-back cacheable, SH0 = inner shareable
          - 4-level tables: L0 (PGD) → L1 (PUD) → L2 (PMD) → L3 (PTE)
          - Identity map all physical RAM
          - Set TTBR0_EL1, isb, enable MMU via SCTLR_EL1.M bit

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  GIC Initialization (v2 or v3, detected at runtime)
          Read GIC version and base addresses from DTB
          (`compatible` = "arm,gic-400" for GICv2, "arm,gic-v3" for GICv3).
          - GICv2 path:
            GICD (Distributor): enable, configure IRQ priorities and
            targets for all SPIs. Set priority mask.
            GICC (CPU Interface): enable, set priority mask to 0xFF
            (accept all priorities), set BPR (binary point).
          - GICv3 path:
            GICD (Distributor): enable, configure affinity routing (ARE=1),
            set priorities for all SPIs.
            GICR (Redistributor): per-CPU, configure SGI/PPI group and
            priority. Enable redistributor.
            ICC system registers: ICC_PMR_EL1 = 0xFF (accept all),
            ICC_IGRPEN1_EL1 = 1 (enable group 1 interrupts).
          Route timer IRQ (PPI 27 = virtual timer) to this CPU.

Phase 9:  Generic Timer
          Configure the ARM generic timer (virtual counter):
          - Write timer period to CNTV_TVAL_EL0
          - Enable timer: CNTV_CTL_EL0 = ENABLE (bit 0), clear IMASK
          - Timer fires IRQ 27 (virtual timer PPI) → tick handler
          Enable interrupts: msr daifclr, #0xf

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

Secondary CPU Bringup (AArch64 via PSCI):

After Phase 10 completes on the primary CPU, secondary CPUs are brought online using PSCI (Power State Coordination Interface):

Phase 11: Secondary CPU Detection
          Parse DTB /cpus node for all CPU entries:
          - Each cpu@N node contains: reg = MPIDR affinity bits
          - device_type = "cpu"
          - enable-method = "psci" (indicates PSCI is used)
          Count CPUs and allocate PerCpu<T> slots.

Phase 12: PSCI Method Detection
          Check /psci node in DTB for PSCI method:
          - method = "smc": Use SMC #0xC4000003 (PSCI_CPU_ON)
          - method = "hvc": Use HVC #0xC4000003 (PSCI_CPU_ON)
          Verify PSCI version via PSCI_VERSION (0x84000000):
          - Major version in bits 31:16, minor in 15:0
          - Require PSCI 1.0+ for full feature support

Phase 13: Secondary CPU Startup (per CPU, sequential)
          For each secondary CPU:
          a. Prepare per-CPU data:
             - Allocate stack (64 KB) in kernel heap
             - Set per-cpu slot: cpu_id, mpidr, stack_top
             - Prepare entry point: secondary_entry()
          b. Call PSCI_CPU_ON:
             - x0 = 0xC4000003 (function ID for CPU_ON AArch64)
             - x1 = target_cpu (MPIDR affinity bits from DTB)
             - x2 = entry_point (secondary_entry physical address)
             - x3 = context_id (cpu_id to pass to secondary)
             - SMC or HVC depending on PSCI method
          c. Check return value:
             - PSCI_RET_SUCCESS (0): CPU starting
             - PSCI_RET_ALREADY_ON (-4): CPU already running
             - PSCI_RET_INVALID_PARAMS (-2): Bad parameters
             - Other negative: failure
          d. Poll for CPU online (timeout 1 second):
             - Secondary sets per-cpu slot flag: online = true
          e. If timeout: log error, mark CPU as failed

Phase 14: Secondary CPU Entry (secondary_entry, per AP)
          Each secondary CPU enters here in EL1 with MMU off:
          a. Enable FPU/NEON (CPACR_EL1.FPEN = 0b11)
          b. Load per-CPU data pointer via MPIDR affinity matching
          c. Load stack pointer from per-cpu slot
          d. Load kernel page tables (TTBR0_EL1)
          e. Enable MMU (SCTLR_EL1.M = 1), ISB
          f. Jump to high-memory entry point: secondary_init()
          g. In secondary_init():
             - Load VBAR_EL1 (exception vectors)
             - Initialize GIC CPU interface (GICv2 GICC or GICv3 ICC regs)
             - Calibrate timer
             - Enable interrupts
             - Set per-cpu online flag
             - Enter scheduler idle loop

Phase 15: SMP Online
          Primary CPU waits for all secondaries to signal online.
          System is fully multi-CPU. GIC affinity routing distributes
          interrupts across all CPUs.

MPIDR affinity (AArch64): Each CPU has a unique MPIDR_EL1 value: - Bits [7:0]: Affinity level 0 (core within cluster) - Bits [15:8]: Affinity level 1 (cluster within socket) - Bits [23:16]: Affinity level 2 (socket) - Bits [31:24]: Affinity level 3 (extended, rare)

The DTB /cpus/cpu@N/reg property contains these affinity bits. PSCI_CPU_ON uses the full MPIDR value to identify the target CPU.

17.2.5 ARMv7 Boot Sequence

QEMU's -M vexpress-a15 -kernel loads the ELF at 0x60010000 and enters at _start in SVC (Supervisor) mode with the MMU off. Registers: r0 = 0, r1 = machine type, r2 = DTB address.

Entry assembly (arch/armv7/entry.S, GNU as syntax):

1. QEMU jumps to _start in SVC mode, MMU off
   - r0 = 0 (unused), r1 = machine type, r2 = DTB address

2. _start:
   a. Disable IRQ and FIQ: cpsid if
      (sets I and F bits in CPSR)
   b. Set up IRQ mode stack: switch to IRQ mode (cps #0x12),
      load 4 KB IRQ stack, switch back to SVC mode (cps #0x13)
   c. Load SVC stack pointer: ldr sp, =_stack_top
      (64 KB stack in .bss._stack, 16-byte aligned via .align 4)
   d. Clear BSS: zero memory from __bss_start to __bss_end
      (str r6 loop, 4 bytes per iteration)
   e. Prepare 64-bit arguments (AAPCS: u64 passed as register pairs):
      - r0:r1 = 0:0 (multiboot_magic, both halves)
      - r2:r3 = dtb_addr:0 (multiboot_info, low:high)
   f. Branch: bl isle_main
   g. Halt loop: wfe if isle_main returns

Stack (64 KB) is in .bss._stack (16-byte aligned via .align 4, which on ARM GAS means 2^4 = 16 bytes). The linker script (linker-armv7.ld) places .text._start first at 0x60010000 (offset from the vexpress-a15 base 0x60000000 to leave room for the bootloader stub).

Initialization phases (in isle_main(), sequential):

Phase 1:  Exception Vectors (VBAR)
          Write vector table base to VBAR via CP15 c12 register:
          mcr p15, 0, <reg>, c12, c0, 0
          Vector table: 8 entries (Reset, Undef, SVC, Prefetch Abort,
          Data Abort, reserved, IRQ, FIQ) × 4-byte branch instructions.
          Each vector branches to a full handler stub.

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly, same
          pattern as x86 entry.asm step 2d). Perform any additional
          initialization that depends on zeroed static data.

Phase 3:  DTB Parse
          Parse the DTB passed in r2 (see §17.2.7). Extract /memory
          regions, /chosen bootargs, GIC base addresses, timer IRQ
          numbers, and UART base. vexpress-a15 has well-known addresses
          but DTB parsing keeps the code machine-independent.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). The vexpress-a15
          machine provides up to 1 GB RAM starting at 0x60000000 (or
          0x80000000 depending on configuration). Reserve kernel image.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 12).

Phase 6:  Virtual Memory (TTBR0, Short Descriptor)
          Build identity-map using ARMv7 short-descriptor format:
          - TTBR0: points to L1 table (4096 × 32-bit entries, 16 KB)
          - L1 entries: section descriptors (1 MB pages) for identity map
            Flags: AP=0b11 (full access), TEX/C/B for normal cacheable
          - DACR: domain 0 = Client (0b01), all others = No Access
          - Enable MMU: set SCTLR.M bit via mcr p15, 0, <reg>, c1, c0, 0
          - 1 MB sections are sufficient for initial identity map;
            L2 tables (256 × 4 KB pages) added later for fine-grained mapping

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  GIC Initialization
          ARMv7 platforms typically use GICv2 (GICv3 supports ARMv7/AArch32
          but is rare on ARMv7 SoCs; limited to 3 affinity levels in AArch32).
          Read GICD/GICC bases from DTB (vexpress-a15 defaults:
          GICD = 0x2C001000, GICC = 0x2C002000).
          Configure distributor, CPU interface, route timer IRQ.

Phase 9:  Timer
          Configure SP804 dual timer or ARM generic timer (if available):
          - SP804 (vexpress): program LOAD register, enable with
            periodic mode + interrupt enable, IRQ via GIC SPI
          - Generic timer (Cortex-A15): CNTVCT, CNTV_TVAL, CNTV_CTL
            (same registers as AArch64, accessed via CP15 c14)
          Enable interrupts: cpsie if

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

17.2.6 RISC-V 64 Boot Sequence

QEMU's -M virt -bios default -kernel runs OpenSBI in M-mode, which then jumps to the kernel at 0x80200000 in S-mode (Supervisor mode). Registers: a0 = hart_id, a1 = DTB address.

Entry assembly (arch/riscv64/entry.S, GNU as syntax):

1. OpenSBI jumps to _start in S-mode
   - a0 = hart_id (hardware thread ID, usually 0 on single-core)
   - a1 = DTB address

2. _start:
   a. Disable interrupts: csrci sstatus, 0x2
      (clears SIE bit in supervisor status register)
   b. Load stack pointer: la sp, _stack_top
      (64 KB stack in .bss._stack, 16-byte aligned)
   c. Clear BSS: zero memory from __bss_start to __bss_end
      (sd zero loop, 8 bytes per iteration)
   d. Arguments already in correct registers:
      a0 = hart_id (passed as multiboot_magic parameter)
      a1 = DTB address (passed as multiboot_info parameter)
   e. Call: call isle_main (jal with ra)
   f. Halt loop: wfi (wait-for-interrupt) if isle_main returns

Stack (64 KB) is in .bss._stack (16-byte aligned). The linker script (linker-riscv64.ld) places .text._start first at 0x80200000, after the OpenSBI firmware region (0x800000000x801FFFFF).

Initialization phases (in isle_main(), sequential):

Phase 1:  Exception Vectors (stvec)
          Write trap handler address to stvec CSR in Direct mode
          (stvec[1:0] = 0b00). All traps — exceptions, software
          interrupts, external interrupts — vector to a single entry
          point that reads scause to dispatch.

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly, same
          pattern as x86 entry.asm step 2d). Perform any additional
          initialization that depends on zeroed static data.

Phase 3:  DTB Parse
          Parse the DTB passed in a1 (see §17.2.7). Extract /memory
          regions, /chosen bootargs, PLIC base address, CLINT address
          (if present), and UART base. QEMU virt machine uses standard
          addresses but DTB parsing keeps the code machine-independent.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Mark available
          regions free. Reserve:
          - OpenSBI firmware: 0x80000000–0x801FFFFF (2 MB)
          - Kernel image: 0x80200000 to __kernel_end
          Unlike x86, no legacy BIOS region to reserve.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.
          This initial heap size is a bootstrap minimum; the allocator expands
          dynamically once memory discovery completes (Section 12).

Phase 6:  Virtual Memory (satp, Sv48)
          Build identity-map using Sv48 (4-level, 48-bit VA):
          - 4 levels: L3 (root) → L2 → L1 → L0, each 512 × 8-byte PTEs
          - PTE format: [53:10] PPN, [7:0] flags (V, R, W, X, U, G, A, D)
          - Identity map all physical RAM with RWX + Valid + Global
          - Write root table PPN to satp: MODE=Sv48 (9), ASID=0, PPN
          - Execute sfence.vma to flush TLB after satp write

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  PLIC Initialization
          Read PLIC base address from DTB (QEMU virt default: 0x0C000000).
          - Set priority threshold to 0 (accept all priorities)
          - Enable relevant interrupt sources (UART, etc.)
          - Set priority for each source
          PLIC handles external interrupts only; timer and software
          interrupts go through separate CSRs (sie.STIE, sie.SSIE).

Phase 9:  SBI Timer
          Use SBI ecall to program the timer:
          - Read current time: csrr time (or rdtime pseudo-instruction)
          - Set next deadline: sbi_set_timer(time + interval)
            (SBI EID=0x54494D45 "TIME", FID=0)
          - Enable timer interrupt: set sie.STIE (bit 5)
          Timer fires supervisor timer interrupt (scause = 5) →
          clear by calling sbi_set_timer with next deadline.
          Enable interrupts: csrsi sstatus, 0x2

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

17.2.7 Device Tree Blob Parsing

The Device Tree Blob (DTB) is the memory map and hardware description format shared by AArch64, ARMv7, RISC-V 64, PPC32, and PPC64LE. It serves the same role as the Multiboot1 info structure on x86 (§17.2.9), providing the kernel with memory layout and device addresses at boot.

DTB format (Flattened Device Tree / FDT):

Offset  Field           Size    Description
0x00    magic           u32     0xD00DFEED (big-endian)
0x04    totalsize       u32     Total blob size in bytes
0x08    off_dt_struct   u32     Offset to structure block
0x0C    off_dt_strings  u32     Offset to strings block
0x10    off_mem_rsvmap  u32     Offset to memory reservation map
0x14    version         u32     DTB version (17)
0x18    last_comp_ver   u32     Last compatible version (16)
0x1C    boot_cpuid_phys u32     Physical ID of boot CPU
0x20    size_dt_strings u32     Size of strings block
0x24    size_dt_struct  u32     Size of structure block

All multi-byte fields are big-endian. The structure block contains a flattened tree of nodes and properties encoded as tokens: FDT_BEGIN_NODE (0x01), FDT_END_NODE (0x02), FDT_PROP (0x03), FDT_NOP (0x04), FDT_END (0x09).

Minimal parser (isle-kernel/src/boot/dtb.rs):

The kernel implements a minimal, no-alloc DTB parser that walks the structure block once and extracts only what's needed for boot:

  1. Validate header: check magic (0xD00DFEED), version ≥ 16
  2. /memory nodes → collect reg property values as MemoryRegion array (base + size pairs), passed to phys::init()
  3. /chosen node → extract bootargs property (kernel command line)
  4. Interrupt controller → extract reg property from the node with interrupt-controller property (GIC base for ARM, PLIC base for RISC-V)
  5. Timer → extract IRQ numbers from /timer node interrupts property
  6. UART → extract reg property from /serial or stdout-path device

The parser operates on raw byte slices with explicit big-endian reads and requires no heap allocation. It uses a fixed-size array (up to 64 entries) for memory regions, matching the Multiboot1 parser's approach. The DTB parser uses a fixed 64-entry buffer during early boot (before the heap allocator is available). Device tree nodes beyond this limit are parsed in a second pass after heap initialization.

Shared code: The DTB parser in isle-kernel/src/boot/dtb.rs is used by all five non-x86 architectures. Each architecture's boot.rs calls dtb::parse(dtb_addr) and passes the resulting memory regions to phys::init().

17.2.8 Cross-Architecture Comparison

The following table summarizes which boot components are architecture-specific and which are shared across all six architectures:

Phase x86-64 AArch64 ARMv7 RISC-V 64 PPC32 PPC64LE
Exception vectors IDT (256 entries) VBAR_EL1 (16 vectors) VBAR CP15 (8 vectors) stvec (Direct mode) IVPR+IVORn LPCR vector table
Memory map source Multiboot1 info DTB /memory DTB /memory DTB /memory DTB /memory DTB /memory
Page table format 4-level PML4 (4 KB) 4-level 4 KB granule Short-desc 2-level (1 MB sections) Sv48 4-level 2-level (4 KB pages) Radix tree (POWER9+) or HPT
IRQ controller 8259 PIC (I/O ports) GIC v2/v3 (MMIO, detected at runtime) GICv2 (MMIO) PLIC (MMIO) OpenPIC (MMIO) XIVE (MMIO)
Timer PIT (I/O port 0x40) Generic timer (system regs) SP804 or generic timer SBI ecall Decrementer (DEC SPR) Decrementer (DEC SPR)
Boot assembly NASM (32→64 transition) GNU as (EL1 entry) GNU as (SVC entry) GNU as (S-mode entry) GNU as (supervisor entry) GNU as (supervisor entry)
BSS clearing entry.asm (rep stosd) entry.S (str xzr loop) entry.S (str r6 loop) entry.S (sd zero loop) entry.S (stw loop) entry.S (std loop)
Phys allocator shared bitmap shared bitmap shared bitmap shared bitmap shared bitmap shared bitmap
Heap allocator shared free-list shared free-list shared free-list shared free-list shared free-list shared free-list
Capability system shared shared shared shared shared shared
Scheduler shared shared shared shared shared shared

17.2.9 Multiboot1 Memory Map Parsing

boot/multiboot1.rs parses the Multiboot1 info structure (passed by GRUB/QEMU) to extract the physical memory map:

  1. Read info structure flags to determine which fields are present
  2. If FLAG_MEM set: log basic memory sizes (lower/upper KB)
  3. If FLAG_CMDLINE set: log the kernel command line string
  4. If FLAG_MMAP set: iterate the memory map entries:
  5. Each entry has: base_addr (u64), length (u64), type (u32)
  6. Types: available (1), reserved (2), ACPI reclaimable (3), NVS (4), defective (5)
  7. Unaligned reads used (read_unaligned) — Multiboot mmap entries may not be aligned
  8. Collect up to 64 MemoryRegion structs, pass to phys::init()

phys::init() processes the regions: - Phase 1: Mark all available regions as free (page-aligned) - Phase 2: Reserve first 1 MB (BIOS, VGA, legacy) - Phase 3: Reserve kernel image (1 MB to __kernel_end)

17.2.10 PPC32 Boot Sequence

PPC32 targets embedded PowerPC processors (e500, 440, etc.) using QEMU's ppce500 machine. The firmware (U-Boot or QEMU direct boot) passes a DTB pointer in r3.

Entry assembly (arch/ppc32/entry.S, GNU as syntax):

1. Firmware loads ELF and jumps to _start in supervisor mode
   - r3 = DTB address
   - r4 = kernel image start (optional)
   - r5 = 0 (reserved)
2. _start:
   a. Set up stack pointer (r1) from linker symbol
   b. Clear BSS (.sbss + .bss)
   c. Set up initial exception vectors (IVPR + IVORn)
   d. Call isle_main(0, r3)  [magic=0, info=DTB address]

The linker script (linker-ppc32.ld) places .text._start first at the kernel load address. PPC32 uses big-endian byte order by default.

Initialization phases (in isle_main(), sequential):

Phase 1:  Exception Vectors (IVPR + IVORn)
          Set IVPR to exception vector base address.
          Initialize IVOR0-IVOR15 for each exception type:
          - IVOR0 (Critical input), IVOR1 (Machine check)
          - IVOR2 (Data storage), IVOR3 (Instruction storage)
          - IVOR4 (External input), IVOR5 (Alignment)
          - IVOR6 (Program), IVOR8 (System call)
          - IVOR10 (Decrementer), IVOR11 (Fixed interval timer)
          - IVOR12 (Watchdog), IVOR13 (Data TLB)
          - IVOR14 (Instruction TLB), IVOR15 (Debug)

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly).

Phase 3:  DTB Parse
          Parse the DTB passed in r3 (see §17.2.7). Extract /memory
          regions, /chosen bootargs, OpenPIC base address, UART base.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Reserve kernel image.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.

Phase 6:  Virtual Memory (2-level page tables)
          Build identity-map using PPC32 2-level page table format:
          - PGD (Page Directory): 1024 × 32-bit entries (4 KB)
          - PTE (Page Table): 1024 × 32-bit entries per PGD entry (4 KB each)
          - Use 4 KB pages with WIMG bits for cache policy
          - Enable MMU via MSR[IR] and MSR[DR] bits

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  OpenPIC Initialization
          Read OpenPIC base address from DTB.
          - Configure interrupt vector base
          - Set priority for each interrupt source
          - Enable external interrupts via MSR[EE]

Phase 9:  Decrementer Timer
          Program the decrementer (DEC SPR) for periodic interrupts:
          - Load initial value into DEC
          - Enable decrementer exception via MSR[DE]
          Timer fires decrementer exception → reload DEC in handler.

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

17.2.11 PPC64LE Boot Sequence

PPC64LE targets IBM POWER processors (POWER8, POWER9, POWER10) in little-endian mode. QEMU uses the pseries machine type with SLOF firmware, which passes a DTB pointer in r3. Bare metal systems use OPAL (skiboot) firmware.

Entry assembly (arch/ppc64le/entry.S, GNU as syntax):

1. SLOF/OPAL loads ELF and jumps to _start in hypervisor or supervisor mode
   - r3 = DTB address
   - r4 = 0 (reserved)
   - MSR: 64-bit mode (SF=1), little-endian (LE=1)
2. _start:
   a. Set up TOC pointer (r2) from .TOC. symbol
   b. Set up stack pointer (r1) from linker symbol
   c. Clear BSS
   d. Set up initial exception vectors
   e. Call isle_main(0, r3)  [magic=0, info=DTB address]

The linker script (linker-ppc64le.ld) places .text._start first at the kernel load address. PPC64LE uses the ELFv2 ABI with little-endian byte order.

Initialization phases (in isle_main(), sequential):

Phase 1:  Exception Vectors (LPCR + HSPRG0/1)
          Set HSPRG0 to per-CPU data pointer.
          Configure LPCR for exception vector base.
          Initialize system reset and machine check handlers.

Phase 2:  BSS Verification
          Verify BSS is zeroed (entry.S clears BSS in assembly).

Phase 3:  DTB Parse
          Parse the DTB passed in r3 (see §17.2.7). Extract /memory
          regions, /chosen bootargs, XIVE base addresses, UART base.

Phase 4:  Physical Memory Manager
          Pass DTB memory regions to phys::init(). Reserve kernel image.

Phase 5:  Kernel Heap
          Allocate 256 contiguous 4 KB frames (1 MB). Initialize
          free-list allocator → enables alloc::Vec, Box, String.

Phase 6:  Virtual Memory (Radix MMU on POWER9+, HPT on POWER8)
          Detect MMU type from DTB or CPU features:
          - POWER9+: Use Radix MMU (3-level page tables, 4 KB/64 KB/2 MB pages)
            Configure LPCR[HR] = 0 for Radix mode.
            Set up process table (PRTB) and page table root (PGD).
          - POWER8: Use HPT (Hash Page Table, 16 MB base page)
            Configure LPCR[HR] = 1 for HPT mode.
            Set up HPT base and size in SDR1.
          Enable MMU via MSR[IR] and MSR[DR] bits.

Phase 7:  Capability System
          Create CapSpace, test create/check/attenuate operations.

Phase 8:  XIVE Interrupt Controller
          Read XIVE base addresses from DTB.
          - Initialize Interrupt Controller (IC) registers
          - Initialize Thread Interrupt Management (TIMA)
          - Configure interrupt priorities and routing
          - Enable external interrupts via MSR[EE]

Phase 9:  Decrementer Timer
          Program the decrementer (DEC SPR) for periodic interrupts:
          - Load initial value into DEC (32-bit, wraps at 0)
          - Enable decrementer exception via MSR[DE]
          Timer fires decrementer exception → reload DEC in handler.
          Note: POWER9+ also has HDEC (Hypervisor Decrementer) for L1 guests.

Phase 10: Scheduler
          Initialize round-robin scheduler. Spawn test threads.
          Run cooperative yield loop, then enable preemptive
          scheduling via timer tick callback.

17.3 Production Boot Target

The following subsections describe the target boot architecture for production deployments. None of this is implemented yet — it represents the design goal that the Multiboot implementation will evolve toward (see §17.7 for the migration path).

17.3.1 Goal: Drop-in Kernel Package

ISLE installs as a standard kernel package alongside the existing Linux kernel. The user can dual-boot between them using the GRUB menu.

# Debian / Ubuntu
apt install isle
update-initramfs -c -k isle-1.0.0
update-grub

# RHEL / Fedora
dnf install isle
dracut --force /boot/initramfs-isle-1.0.0.img isle-1.0.0
grub2-mkconfig -o /boot/grub2/grub.cfg

# Arch Linux
pacman -S isle
mkinitcpio -p isle

# Reboot, select "ISLE 1.0.0" from GRUB menu
# Existing Linux kernel is always available as a fallback entry

17.3.2 Boot Requirements

  • Image format: ELF kernel image with an embedded PE/COFF stub header, compatible with GRUB2 (loading as ELF), systemd-boot, and UEFI direct boot (loading as PE/COFF). Installed as /boot/vmlinuz-isle-VERSION (the "vmlinuz" name is a convention; the actual format is a PE/COFF-stubbed ELF, similar to Linux's bzImage with EFISTUB).
  • Boot protocol: x86 Linux boot protocol (for BIOS legacy boot) and UEFI stub (for UEFI direct boot). Both are supported.
  • Initramfs: Custom initramfs containing ISLE-native drivers for early boot (storage controller, root filesystem). Built using standard tools (dracut, mkinitcpio) with ISLE-specific hooks.
  • /boot layout: Fully compatible with existing distribution tools.
  • /boot/vmlinuz-isle-VERSION
  • /boot/initramfs-isle-VERSION.img
  • /boot/System.map-isle-VERSION (optional, for debugging)
  • Kernel command line: Standard Linux cmdline parameters are parsed and honored (root=, console=, quiet, init=, rw/ro, etc.).

17.3.3 Target Boot Sequences

x86-64 (production):

1. UEFI firmware (PE/COFF stub) / BIOS bootloader loads kernel image
2. Boot stub (Rust/asm) sets up:
   - Identity-mapped page tables
   - GDT, IDT stubs
   - Stack
3. Jump to Rust entry point (isle_core::main)
4. ISLE Core initialization:
   a. Parse boot parameters and ACPI tables
   b. Initialize physical memory allocator (from e820/UEFI memory map)
   c. Initialize virtual memory (kernel page tables, PCID)
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: APIC, timer, early console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init (typically systemd)

AArch64 (production):

1. UEFI firmware or QEMU -kernel loads the ELF, jumps to _start in EL1
2. Boot stub (assembly) sets up:
   - Exception vectors (VBAR_EL1)
   - Stack pointer
   - MMU disabled (identity-mapped initially)
3. Jump to Rust entry point (isle_core::main)
4. ISLE Core initialization:
   a. Parse device tree blob (DTB) passed in x0
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (TTBR0_EL1/TTBR1_EL1, ASID, TCR_EL1)
   d. Initialize per-CPU data structures (MPIDR_EL1 affinity)
   e. Initialize Tier 0 drivers: GIC (distributor + redistributor), generic timer, early console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

No microcode loading is performed — ARM firmware updates are handled by the platform firmware (UEFI capsule updates or vendor-specific mechanisms), not the kernel. This is architecturally correct: ARM's trust model places firmware updates in the Secure World (EL3/EL2), not in the Normal World OS.

ARMv7 (production):

1. QEMU vexpress-a15 loads the ELF, jumps to _start in SVC mode
2. Boot stub (assembly) sets up:
   - Vector table (VBAR)
   - Stack pointer
   - Interrupts disabled (CPSR I+F bits)
3. Jump to Rust entry point (isle_core::main)
4. ISLE Core initialization:
   a. Parse device tree blob (DTB) passed in r2
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (TTBR0, DACR for domain isolation)
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: GIC, SP804 timer, early UART console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

ARMv7 does not have microcode updates. CPU errata on ARMv7 are addressed through kernel code paths (alternative instruction sequences) selected at boot based on the MIDR (Main ID Register) value.

RISC-V 64 (production):

1. OpenSBI (M-mode firmware) initializes hardware, jumps to _start in S-mode
   a0 = hart_id, a1 = DTB address
2. Boot stub (assembly) sets up:
   - Trap vector (stvec)
   - Stack pointer
   - Interrupts disabled (sstatus.SIE = 0)
3. Jump to Rust entry point (isle_core::main)
4. ISLE Core initialization:
   a. Parse device tree blob (DTB) from a1
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (satp CSR, Sv48 mode, ASID)
   d. Initialize per-CPU data structures (per-hart)
   e. Initialize Tier 0 drivers: PLIC, timer (via SBI ecall), early 16550 UART
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

RISC-V does not have microcode updates. CPU errata are handled by OpenSBI (M-mode) or by kernel alternative code paths selected based on the mvendorid/marchid/ mimpid CSRs (exposed via SBI or DTB).

PPC32 (production):

1. U-Boot or QEMU loads ELF, jumps to _start in supervisor mode
   r3 = DTB address
2. Boot stub (assembly) sets up:
   - Stack pointer (r1)
   - Exception vectors (IVPR base + IVOR offsets)
   - Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (isle_core::main)
4. ISLE Core initialization:
   a. Parse device tree blob (DTB) from r3
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (TLB1 entries for initial mapping, then software page table)
   d. Initialize per-CPU data structures
   e. Initialize Tier 0 drivers: OpenPIC, decrementer timer, early UART console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

PPC32 does not have microcode updates. CPU errata are handled by kernel code paths selected at boot based on the PVR (Processor Version Register).

PPC64LE (production):

1. SLOF/OPAL firmware loads ELF, jumps to _start
   r3 = DTB address, MSR: SF=1, LE=1
2. Boot stub (assembly) sets up:
   - TOC pointer (r2) for position-independent data access
   - Stack pointer (r1)
   - Exception vectors (via LPCR and HSPRG0/1)
   - Interrupts disabled (MSR EE=0)
3. Jump to Rust entry point (isle_core::main)
4. ISLE Core initialization:
   a. Parse device tree blob (DTB) from r3
   b. Initialize physical memory allocator (from DTB /memory nodes)
   c. Initialize virtual memory (Radix MMU on POWER9+, HPT fallback on POWER8)
   d. Initialize per-CPU data structures (PIR = Processor Identification Register)
   e. Initialize Tier 0 drivers: XIVE interrupt controller, decrementer timer, early UART console
   f. Initialize capability system
   g. Initialize scheduler
   h. Mount initramfs (tmpfs)
5. Load Tier 1 storage driver from initramfs
6. Mount real root filesystem
7. Execute /sbin/init

PPC64LE does not have user-loadable microcode. POWER processor firmware updates are applied by the service processor (FSP or BMC) out-of-band, not by the OS kernel.

17.4 CPU Errata and Microcode

Modern CPUs ship with known errata — hardware bugs documented in vendor errata sheets. ISLE handles these systematically rather than scattering workarounds through the codebase.

Early microcode loading — CPU microcode is applied before most kernel initialization, matching the Linux early microcode loading model. The microcode blob is located by scanning the raw initramfs image in physical memory (NOT by mounting the filesystem — initramfs mount happens later at step 4h). Linux uses the same approach: the bootloader provides an uncompressed CPIO archive prepended to the initramfs; the kernel extracts the microcode by parsing the raw CPIO headers in memory at boot.

The microcode update runs between steps 4b (physical memory allocator init) and 4c (virtual memory init):

Boot step (between 4b and 4c): Early microcode update
  1. Scan raw initramfs blob in physical memory for microcode CPIO archive
     (/lib/firmware/intel-ucode/ or /lib/firmware/amd-ucode/ paths in CPIO)
  2. Validate signature (vendor-signed, no user-modifiable microcode)
  3. Apply via WRMSR to IA32_BIOS_UPDT_TRIG (Intel) or MSR_AMD64_PATCH_LOADER (AMD)
  4. Re-read CPUID — microcode may change feature flags (critical: must happen
     before step 4c which uses CPUID to configure page table features)
  5. Log applied microcode revision to ring buffer

Errata database — After microcode loading and CPUID enumeration, ISLE consults a per-CPU-model quirk table:

/// CPU errata entry — matches a specific CPU stepping to its required workarounds.
struct CpuErrata {
    /// CPU identification (vendor, family, model, stepping range).
    match_id: CpuMatch,
    /// Human-readable errata identifier (e.g., "SKX003", "ZEN4-ERR-1234").
    errata_id: &'static str,
    /// Workaround function applied during boot.
    workaround: fn() -> Result<()>,
    /// Category for boot-parameter override.
    category: ErrataCat,
}

enum ErrataCat {
    /// MSR write to disable/enable a feature.
    MsrTweak,
    /// Alternative code path (e.g., retpoline instead of indirect branch).
    CodePath,
    /// Disable a CPU feature entirely.
    FeatureDisable,
}

The quirk table is checked during boot (step 4d, after CPUID). Each matching entry's workaround function is called. Workarounds are logged to the ring buffer.

Spectre/Meltdown class mitigations:

Vulnerability Mitigation ISLE scope
Meltdown (v3) KPTI (page table isolation) Required for Tier 2 + userspace; NOT needed for Tier 1 (same ring, MPK isolation)
Spectre v1 LFENCE barriers at bounds checks; Speculative Load Hardening (SLH) Compiler-inserted SLH (-mllvm -x86-speculative-load-hardening); manual LFENCE in asm hot paths
Spectre v2 Retpoline / IBRS / eIBRS Retpoline (-C target-feature=+retpoline-indirect-branches) for indirect branches in kernel code; eIBRS preferred on supporting hardware
Spectre v4 (SSB) SSBD (Spec. Store Bypass Disable) Per-thread via IA32_SPEC_CTRL MSR; toggled on context switch for untrusted threads
MDS/TAA Buffer clears (VERW) On context switch to userspace; on VM entry/exit
SRBDS Microcode + VERW Handled by early microcode update
RFDS/GDS Microcode + opt-in VERW Same as MDS path

Mitigation boot parameters:

isle.mitigate=auto          # Default: apply mitigations based on detected CPU (recommended)
isle.mitigate=on            # Force all mitigations on, even if CPU claims to be fixed
isle.mitigate=off           # Disable all mitigations (INSECURE — benchmarking only)
isle.mitigate.kpti=off      # Disable specific mitigation class
isle.mitigate.retpoline=off # Disable specific mitigation class

Runtime reporting — Vulnerability status is exposed via Linux-compatible sysfs:

/sys/devices/system/cpu/vulnerabilities/meltdown:       "Mitigation: PTI"
/sys/devices/system/cpu/vulnerabilities/spectre_v1:     "Mitigation: usercopy/LFENCE"
/sys/devices/system/cpu/vulnerabilities/spectre_v2:     "Mitigation: eIBRS"
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass: "Mitigation: SSBD"
/sys/devices/system/cpu/vulnerabilities/mds:            "Mitigation: Clear buffers"

This ensures monitoring tools (spectre-meltdown-checker, lynis) work without modification.

17.5 Speculation Mitigations (All Architectures)

The x86-specific mitigation table in Section 17.4 covers only one architecture. Here is the complete per-architecture mitigation matrix:

AArch64 mitigations:

Vulnerability ARM Identifier Mitigation ISLE scope
Spectre v1 (bounds bypass) CSDB barriers at bounds checks Compiler-inserted CSDB barriers after conditional branches (ARM equivalent of x86 SLH; uses CSDB instruction, not LLVM's x86-specific -x86-speculative-load-hardening pass)
Spectre v2 (BTI) CVE-2017-5715 BTI (Branch Target Identification) Hardware BTI (ARMv8.5+): enabled via SCTLR_EL1.BT1. Software: SMCCC ARCH_WORKAROUND_1 firmware call
Spectre-BHB CVE-2022-23960 BHB clearing sequence or firmware call SMCCC ARCH_WORKAROUND_3 or BHB clearing loop on context switch
Meltdown (v3) CVE-2017-5754 KPTI (separate EL0/EL1 page tables) Full KPTI required on Cortex-A75 (all revisions) and Cortex-A520 (prior to r0p2 only; r0p2+ is not affected) per ARM security bulletins; NOT needed on Cortex-A76 (all revisions), Cortex-A78, Cortex-X1, Cortex-X2, Cortex-A710, Cortex-A715 (NOT affected per ARM security bulletin — CSV3=Yes across all revisions), or other Armv9/v8.x cores with CSV3, or earlier in-order cores (A53, A55, etc.). Cortex-A510 (all revisions): classified as Variant 3 by ARM, but the actual erratum (3117295) describes a speculative unprivileged load issue whose workaround is a TLBI instruction before returning to EL0, not full KPTI page table splitting. ISLE applies the lightweight TLBI mitigation for A510, not the heavyweight page table split used for Cortex-A75.
Spectre v4 (SSB) CVE-2018-3639 SSBS (Speculative Store Bypass Safe) Hardware SSBS bit (ARMv8.5+): per-thread via PSTATE.SSBS. Software: SMCCC ARCH_WORKAROUND_2
Straight-line speculation SB instruction after branches Compiler-inserted speculation barriers

ARM firmware interface: Unlike x86 (which uses MSR writes), ARM mitigations are often applied through SMCCC (SMC Calling Convention) firmware calls to EL3 Secure Monitor code. The kernel calls ARCH_WORKAROUND_1/2/3 — the firmware applies the actual mitigation. This is architecturally cleaner (firmware knows the exact CPU revision) but adds ~100-200 cycles per SMCCC call.

ARMv7 mitigations:

Vulnerability Mitigation ISLE scope
Spectre v1 CSDB barriers at bounds checks Same as AArch64
Spectre v2 Firmware workaround via SMCCC ARCH_WORKAROUND_1 for affected Cortex-A cores
Meltdown Not applicable ARMv7 Cortex-A cores are not affected
Spectre v4 Firmware workaround ARCH_WORKAROUND_2 where supported

RISC-V mitigations:

Vulnerability Mitigation ISLE scope
Spectre v1 FENCE instructions at bounds checks Manual insertion in assembly; compiler support evolving
Spectre v2 Vendor-specific SiFive: FENCE.I after indirect branches. Other vendors: per-implementation
Meltdown Not applicable In-order RISC-V cores not affected; OoO cores (e.g., SiFive P670) may need KPTI
Spectre v4 Vendor-specific No standard RISC-V mitigation; per-vendor microarchitecture

RISC-V status: Speculation mitigations on RISC-V are less mature than x86 or ARM. The RISC-V CFI extensions Zicfiss (shadow stacks) and Zicfilp (landing pads) are ratified as standalone extensions (not part of the base privileged specification, but separate ratified ISA extensions). ISLE implements both when the hardware reports support via the Zicfiss and Zicfilp ISA string entries. ISLE also applies vendor-specific workarounds based on mvendorid/marchid from the device tree, similar to the x86 errata database approach.

PowerPC mitigations:

Vulnerability Mitigation ISLE scope
Spectre v1 ori 31,31,0 (speculation barrier) Inserted at bounds checks in assembly
Spectre v2 Count Cache Flush + link stack flush POWER8/9: bcctr flush sequence; POWER10: hardware mitigation
Meltdown RFI flush (L1D cache flush) POWER7+: flush on return from interrupt via rfid/hrfid
Spectre v4 STF (Store Thread Forwarding) barrier ori 31,31,0 barrier; POWER9+ firmware toggle

PowerPC status: IBM POWER processors have well-documented mitigations managed via firmware (skiboot/OPAL) and kernel runtime patches. POWER10 includes hardware mitigations for most Spectre variants. PPC32 embedded cores (e500, 440) are generally in-order and not affected by speculative execution vulnerabilities. ISLE applies mitigations based on PVR (Processor Version Register) from the device tree.

Runtime reporting (all architectures) — The Linux-compatible sysfs interface (/sys/devices/system/cpu/vulnerabilities/) is populated on all architectures with architecture-appropriate mitigation status strings.

17.6 Dual-Boot Safety

  • ISLE never modifies the existing Linux kernel installation.
  • GRUB is configured with both kernels; the default can be set by the user.
  • If ISLE fails to boot, the user selects the Linux kernel from GRUB.
  • A "last known good" mechanism records successful boots and can auto-revert.

17.7 Boot Protocol Migration Path

The boot architecture evolves through four phases, each building on the previous:

Phase 1 — Multiboot1 (current). GRUB loads the ELF via multiboot command. QEMU loads directly with -kernel. Memory map from Multiboot1 info structure. Sufficient for all kernel development and QEMU-based testing.

Phase 2 — Multiboot2 full parser. Parse Multiboot2 tags to access richer boot information: ACPI RSDP pointer, EFI memory map, framebuffer info, boot services tag. This enables ACPI table parsing and EFI runtime services without changing the bootloader. GRUB2 already supports the multiboot2 command.

Phase 3 — UEFI stub boot. Add a PE/COFF header stub to the kernel image (similar to Linux EFISTUB). UEFI firmware requires PE/COFF executables, not ELF — the stub header makes the kernel image a valid PE/COFF binary that UEFI can load directly. The actual kernel code remains ELF internally; the PE/COFF header is a thin wrapper (like Linux's header.S which embeds a PE/COFF header in the bzImage). The kernel becomes directly bootable from UEFI firmware without GRUB — efibootmgr can register it. Use EFI boot services for memory map and GOP framebuffer, then call ExitBootServices() before entering the kernel proper. systemd-boot and other UEFI-native boot managers work at this stage.

Phase 4 — Linux boot protocol. Implement the x86 Linux boot protocol (struct boot_params at 0x10000). This makes the ISLE kernel loadable by any Linux-compatible bootloader. Combined with a standard /boot layout and initramfs, this enables the drop-in package installation described in §17.3.1. This is the final production boot target.

17.8 Secure Boot and Measured Boot

Secure Boot and Measured Boot are kernel-level boot-phase concerns. They apply equally to servers (enterprise attestation, confidential computing), cloud instances (vTPM-based instance identity), and consumer devices (UEFI Secure Boot for firmware lockdown). Neither feature is consumer-specific.

17.8.1 UEFI Secure Boot

UEFI Secure Boot enforces a chain of trust starting in firmware: the UEFI db (allowed signature database) and dbx (revocation list) are stored in firmware NVRAM. Every executable in the boot path (bootloader, shim, kernel) must be signed by a key in the db.

Deployment models:

Model Chain When used
Shim + GRUB Microsoft UEFI CA → shim (signed by MS) → GRUB (signed by distro) → kernel (signed by distro) Default for distros shipping via OEM
UEFI direct Custom key enrolled in db → kernel PE/COFF (signed by ISLE key) Self-managed servers, custom deployments
Unsigned (disabled) No verification Development hardware, QEMU

ISLE requires Phase 3 (UEFI stub, §17.7) before Secure Boot can be supported. The kernel image must be a valid PE/COFF binary for UEFI to verify its signature before loading. The build system produces a signed image via sbsign --key isle-signing.key --cert isle-signing.crt isle-kernel.efi.

Kernel module signing: Once the kernel is Secure Boot-booted, all kernel modules (Tier 1 drivers) must also be signed. Unsigned modules are rejected. The module signing key is separate from the UEFI boot key. The build system embeds the module signing public key in the kernel image; drivers are signed with the corresponding private key during the build.

UEFI Secure Boot state: The kernel reads the UEFI SecureBoot variable from EFI runtime services at boot and records it in a read-only kernel parameter. Userspace can query via /sys/firmware/efi/efivars/SecureBoot-*. This affects policy decisions (e.g., CAP_SYS_MODULE behaviour).

17.8.2 Measured Boot (TPM PCR Chain)

Measured Boot extends a TPM Platform Configuration Register (PCR) with a cryptographic hash at each step of the boot chain. PCRs are append-only (extend = SHA256(current PCR value || new measurement)); they cannot be reset without rebooting. A remote attestation verifier can reconstruct the expected PCR values from the known firmware/bootloader/kernel and check that the running system matches.

Standard x86 PCR assignment (UEFI + Linux convention, which ISLE follows):

PCR What is measured
0 UEFI firmware code and configuration
1 UEFI firmware data (platform config)
2 Option ROM code
3 Option ROM data
4 Boot manager code (GRUB/shim)
5 Boot manager data + GPT partition table
6 Resume from hibernate
7 Secure Boot policy (db, dbx, PK, KEK state)
8 GRUB command line
9 Kernel image (bzImage/ISLE kernel PE/COFF)
10 initramfs
11 Kernel command line
12–15 Available for OS/application use

The kernel extends PCR 9 with its own image hash during early boot (before ExitBootServices() on UEFI paths, or via GRUB's tpm module on Multiboot paths). PCR 10 is extended with the initramfs hash. PCR 11 is extended with the kernel command line.

TPM interface: The kernel accesses the TPM via the TPM CRB (Command Response Buffer, TPM 2.0 mandatory interface) or TPM TIS (legacy 1.2 / 2.0 FIFO interface). The driver is Tier 1, ACPI-probed (MSFT0101 or MSFT0200).

// isle-core/src/tpm/mod.rs

/// TPM 2.0 PCR Extend command.
/// Extends the given PCR with SHA-256(current || digest).
pub fn pcr_extend(pcr_index: u32, digest: &[u8; 32]) -> Result<(), TpmError>;

/// Read back the current value of a PCR.
pub fn pcr_read(pcr_index: u32) -> Result<[u8; 32], TpmError>;

/// Seal a secret to the current PCR state.
/// Returns a TPM2B_PUBLIC + TPM2B_PRIVATE blob.
/// The secret can only be unsealed if PCRs match the policy at seal time.
pub fn seal(pcr_policy: &PcrPolicy, secret: &[u8]) -> Result<SealedBlob, TpmError>;

/// Unseal a blob previously created by seal().
/// Fails if any PCR in the policy has changed since sealing.
pub fn unseal(blob: &SealedBlob) -> Result<Vec<u8>, TpmError>;

Disk encryption integration: seal() is the mechanism for TPM-bound disk encryption keys (equivalent to Linux's tpm2-totp / systemd-cryptenroll). The disk encryption key is sealed to a PCR policy covering PCRs 0, 4, 7, 9, 11 (firmware + Secure Boot policy + kernel + cmdline). Any modification to the boot chain (new kernel, changed cmdline, disabled Secure Boot) causes unseal to fail, prompting for a recovery passphrase.

Confidential computing intersection: On confidential VM platforms (AMD SEV-SNP, Intel TDX, ARM CCA), the TPM is replaced by a virtual TPM whose root of trust is the hardware attestation report (VCEK certificate, TD quote, Realm Attestation Token). The PCR-based measured boot model is the same; the trust root is the hardware VM isolation guarantee rather than a physical TPM chip. §47 covers the distributed/confidential computing architecture.

17.8.3 Kernel Responsibilities Summary

Responsibility In kernel? Notes
Kernel image signing Build-time sbsign in build system
Module signing verification Yes Enforced when Secure Boot active
PCR extension (kernel + cmdline) Yes Early boot, before driver init
TPM driver (CRB/TIS) Yes Tier 1, ACPI-probed
seal() / unseal() API Yes Exposed to userspace via ioctl
Key management policy No Userspace (systemd-cryptenroll, clevis)
Remote attestation protocol No Userspace (keylime, MAA agent)
Boot graphics, splash screen No Bootloader/compositor
Dual-boot chainloading No Bootloader (GRUB)

18. First-Class Architectures

ISLE targets six architectures as first-class citizens. All six receive equal design consideration, CI testing, and performance optimization.

Architecture Status Isolation mechanism Notes
x86-64 Primary dev target Intel MPK (WRPKRU) Most mature, widest hardware
aarch64 First-class, day one POE (ARMv8.9+) / page-table fallback ARM servers, Apple Silicon (VM)
armv7 First-class, day one DACR memory domains Embedded, IoT, Raspberry Pi
riscv64 First-class, day one Page-table based Emerging server/embedded platform
ppc32 First-class, day one Segment registers / page-table based Embedded PowerPC, AmigaOne, networking appliances
ppc64le First-class, day one HPT / Radix MMU / page-table based POWER servers, IBM POWER8/9/10, Raptor Talos II

Architecture-Specific Code

Architecture-specific code is isolated under arch/ and isle-core/src/arch/:

  • Boot code: Rust and assembly, per-architecture
  • Syscall entry/exit: Assembly stubs
  • Context switch: Assembly (register save/restore)
  • Interrupt dispatch: Assembly stubs into Rust handlers
  • vDSO: Per-architecture user-accessible pages
  • MPK / isolation primitives: Abstracted behind a common IsolationDomain trait

Per-architecture hardware abstraction equivalents:

Concept x86-64 AArch64 ARMv7 RISC-V 64 PPC32 PPC64LE
Privilege separation GDT (ring 0/3 segments) Exception levels (EL0/EL1) Processor modes (USR/SVC) Privilege levels (U/S) MSR PR bit (user/supervisor) MSR PR bit (user/supervisor)
Exception dispatch IDT (256 gate descriptors) Exception vector table (VBAR_EL1, 16 entries × 4 vectors) Vector table (VBAR, 8 entries) Trap vector (stvec, single entry + scause dispatch) Exception vector table (IVPR + IVORn) System Reset + Machine Check vectors (LPCR)
Interrupt controller APIC (LAPIC + IOAPIC) GIC v2/v3 (distributor + redistributor/CPU interface, detected at runtime) GIC (distributor + CPU interface) PLIC (+ CLINT for timer/IPI) OpenPIC / MPIC XICS / XIVE (POWER8/9/10)
Timer APIC timer / HPET / TSC Generic Timer (CNTPCT_EL0) Generic Timer (CNTPCT) SBI timer ecall / mtime Decrementer (DEC SPR) Decrementer (DEC SPR) / HDEC
Syscall mechanism SYSCALL/SYSRET (MSRs) SVC instruction (EL0→EL1) SVC instruction (USR→SVC) ecall instruction (U→S) sc instruction (system call) sc instruction / scv (POWER9+)
Page table format 4-level (PML4→PDPT→PD→PT) 4-level (L0→L1→L2→L3) 2-level (L1→L2, 1MB sections) 4-level Sv48 2-level (PGD→PTE, 4 KB pages) Radix tree (POWER9+) or HPT (hashed page table)
Fast isolation MPK (WRPKRU) POE (POR_EL0) / MTE DACR (16 domains) Page-table based Segment registers (16 segments) Radix partition table / HPT LPAR
TLB ID PCID (12-bit, CR3) ASID (8/16-bit, TTBR) ASID (8-bit, CONTEXTIDR) ASID (9-16 bit, satp) PID (8-bit, via PID SPR) PID/LPID (Radix: 20-bit PID, LPIDR)

Everything else -- scheduling, memory management, capability system, driver model, syscall compatibility -- is architecture-independent Rust code.

No 32-bit Compatibility Modes on 64-bit Kernels

ISLE does not support running 32-bit binaries on 64-bit kernels: - No ia32 compatibility mode on x86-64 - No AArch32 compatibility mode on AArch64 - No RV32 compatibility mode on RV64

ARMv7 (32-bit ARM) is supported as a native first-class architecture — it runs a native 32-bit kernel, not a compatibility layer on a 64-bit kernel. This follows the principle that 32-bit support, where needed, is added as a separate target rather than as a compatibility layer that doubles the syscall surface.

Advanced Feature Architecture Parity

Parts XI–XIV define advanced features that rely on architecture-specific hardware mechanisms. The following matrix summarizes support status across all six first-class architectures. Where hardware is unavailable, ISLE either provides a software fallback (reduced performance) or marks the feature as not supported on that architecture. The kernel's #[cfg(target_feature)] mechanism ensures unsupported paths compile to no-ops with zero overhead.

Feature Mechanism x86-64 AArch64 ARMv7 RISC-V 64 PPC32 PPC64LE
Fast driver isolation MPK/POE/DACR/page-table WRPKRU (native) POE (ARMv8.9+, POR_EL0) / page-table fallback DACR 16 domains Page-table based Segment registers (16 segments) / page-table fallback Radix partition table / HPT LPAR
Memory tagging MTE/LAM Intel LAM (pointer tagging only) MTE (full, ARMv8.5+) Not available Not available Not available Not available
Hardware power metering RAPL/SCMI/SBI RAPL (native) SCMI power domain SCMI (limited) SBI PMU (basic) / software estimation Not available (software only) OPAL/OCC power sensors (POWER8/9/10)
Confidential computing SEV-SNP/TDX/CCA/CoVE SEV-SNP + TDX (native) ARM CCA (emerging) Not available RISC-V CoVE (draft) Not available Ultravisor Protected Execution Facility (POWER9+)
Cache partitioning CAT/MPAM Intel CAT + MBA (native) ARM MPAM (ARMv8.4+) Not available Not available (software only) Not available Not available (software only)
Hardware preemption (GPU) Device-dependent Yes (vendor support) Yes (Mali, Adreno) Limited Emerging Not available Limited (Nvidia via PCIe)
CXL memory pooling CXL 2.0/3.0 Native (PCIe 5.0+) Emerging (ARMv9 + CXL) Not available Not available Not available OpenCAPI / CXL (POWER10+)
In-kernel inference ISA extensions AMX (matrix), AVX-512 SME (matrix), SVE (vector) NEON (vector) V extension (vector) AltiVec/SPE (limited) VSX (vector-scalar, POWER7+)

Reading the table: "Native" means hardware support is available and ISLE uses it directly. "Fallback" means ISLE implements the feature using a slower mechanism (typically page-table manipulation). "Not available" means neither hardware nor a practical software fallback exists — the feature is compile-time disabled on that architecture. "Emerging" or "draft" means the hardware specification exists but is not yet widely deployed; ISLE includes provisional support gated behind a feature flag.


19. Hardware Memory Safety

19.1 ARM MTE (Memory Tagging Extension)

ARM MTE ships in ARMv8.5+ silicon. MTE availability depends on both the core IP implementing the extension AND the SoC vendor enabling tag storage in the memory subsystem:

  • Core IP with MTE: ARM Neoverse V2, Neoverse V3 (all cores based on these designs implement the MTE extension at the microarchitectural level).
  • Mobile SoCs with MTE enabled: Google Pixel 8/9 (Tensor G3/G4, Cortex-X3/X4), MediaTek Dimensity 9300+ devices.
  • Datacenter SoC with MTE enabled: AmpereOne (the first datacenter SoC to fully enable MTE at the platform level, including tag storage in DRAM).
  • Cloud SoCs with MTE logic but NOT enabled: AWS Graviton 4 (Neoverse V2) and Google Axion (Neoverse V2) include MTE logic in the cores but their memory subsystems do not support tag storage — MTE is not usable on these platforms despite the core IP implementing it.
  • No MTE: Ampere Altra (Neoverse N1, ARMv8.2 — predates MTE entirely).

Every 16-byte memory granule carries a 4-bit tag. Pointer top bits carry a tag. Hardware compares them on every access. Mismatch = fault. Catches use-after-free, buffer overflow, in hardware, at near-zero runtime cost.

Important limitation: MTE is probabilistic, not complete. 4-bit tags = 16 possible values. Adjacent slab objects may receive the same tag by random chance (probability 1/16 = 6.25%). Single-violation detection rate: ~93.75%. This is acceptable for defense-in-depth — Rust's ownership model is the primary safety mechanism; MTE is an additional hardware layer that catches what Rust cannot (C driver bugs in Tier 1, unsafe blocks, compiler bugs). MTE is NOT a substitute for memory-safe code.

Tag Storage Requirement:

ARM MTE stores tags in storage managed by the memory controller: 4 bits per 16-byte granule. Relative to DRAM capacity, this means tag storage is sized at 3.125% of DRAM (4 bits / 128 bits = 1/32). High-performance implementations (Neoverse V2/V3, AmpereOne) typically use dedicated Tag RAM; other implementations may use reserved DRAM regions managed transparently by the memory controller. In all cases, the storage is invisible to software and managed automatically by the hardware. On SoCs without MTE support, the tagging code is compiled out (#[cfg(target_feature = "mte")]) — zero overhead, zero memory cost. MTE is only available on ARM; x86 systems are entirely unaffected.

TEE interaction: MTE tags are stored in separate physical tag RAM. For TEE-encrypted pages, tag RAM may also be encrypted. Confidential pages are allocated untagged (tag = 0); MTE checking is disabled for pages owned by a ConfidentialContext (see §26.3). Hardware encryption already prevents unauthorized access — MTE is redundant for confidential memory.

Section 12.6 already mentions MTE and Intel LAM. This section details the architectural integration.

19.2 Design: Tag-Aware Memory Allocator

// isle-core/src/mem/tagging.rs

/// Memory tagging policy (system-wide, configurable at boot).
#[repr(u32)]
pub enum TaggingPolicy {
    /// No tagging. Standard allocation. Zero overhead.
    /// Used on hardware without MTE, or for maximum performance.
    Disabled        = 0,

    /// Synchronous tagging: fault immediately on tag mismatch.
    /// Catches all tag violations. ~128 extra cycles per page allocation.
    /// Recommended for development and high-security production.
    Synchronous     = 1,

    /// Asynchronous tagging: record violations in a register, check lazily.
    /// Lower overhead (~10 cycles per allocation), but violations reported
    /// with delay. Good for production with logging.
    Asynchronous    = 2,
}

/// Tag operations for the memory allocator.
pub trait MemoryTagger {
    /// Assign a random tag to a newly allocated region.
    /// Called by: slab allocator (per-object), buddy allocator (per-page).
    fn tag_allocation(&self, addr: *mut u8, size: usize) -> TaggedPtr;

    /// Clear tags on freed memory (set to a "freed" tag value).
    /// Any subsequent access with the old tag will fault.
    fn tag_deallocation(&self, addr: *mut u8, size: usize);

    /// Set tags for a DMA buffer region (tag = 0, untagged).
    /// DMA engines don't understand tags — buffers must be untagged.
    fn untag_dma_region(&self, addr: *mut u8, size: usize);
}

19.3 Integration Points

Slab allocator (Section 12.2):
  Object allocation:
    1. Allocate object from slab (existing path).
    2. Assign random 4-bit tag to the object's 16-byte granules.
    3. Return tagged pointer (tag in top bits).
  Object deallocation:
    1. Return object to slab (existing path).
    2. Set the object's granules to a "freed" tag (e.g., 0xF).
    3. Any subsequent access with the old tag faults immediately.

  Benefit: use-after-free in kernel (or in Tier 1 C drivers) is caught
  by hardware. The fault is caught by domain isolation and triggers driver crash recovery.

Page allocator (Section 12.1):
  Page allocation: tag all 256 granules in the page with a fresh tag.
  Page deallocation: tag all granules with "freed" tag.
  Cost: 256 STG instructions per page alloc/dealloc (or 128 ST2G instructions,
  each tagging two 16-byte granules).
  At ~1 cycle per STG: 256 cycles (128 cycles with ST2G). Page alloc is ~300+ cycles.
  Overhead: ~40-85% of page allocator hot path (when enabled; lower with ST2G).

  Note: this only affects ARM. On x86 without MTE, zero overhead.
  On ARM without MTE enabled, zero overhead (policy = Disabled).

KABI boundary:
  When kernel passes a buffer to a Tier 1 driver:
    Buffer is tagged. Driver receives tagged pointer.
    If driver overflows the buffer: tag mismatch, hardware fault.
    Domain isolation catches the fault, driver is crash-recovered.
  This provides hardware-enforced bounds checking for C drivers,
  even though the kernel is written in Rust (which checks bounds in software).

DMA buffers:
  DMA engines cannot process tagged memory.
  DMA buffers are allocated untagged (tag = 0).
  IOMMU validates DMA addresses regardless.

fork() / CoW:
  Before CoW break: child shares parent's page (same tags, read-only).
  On CoW break (child or parent writes):
    1. Allocate new page, copy data.
    2. Assign FRESH RANDOM tags to the new page's granules.
    3. Do NOT copy the old page's tags.
    Rationale: if both pages kept the same tags, a stale pointer from
    one process could access the other's now-separate page without
    a tag fault (same tag, different physical page). Fresh tags ensure
    that cross-process stale pointers are detected by MTE.

19.4 Intel LAM (Linear Address Masking)

Intel LAM allows using top bits of 64-bit pointers for metadata without them being treated as part of the address. This is less powerful than MTE (no hardware tag checking), but useful for:

  • Pointer authentication (storing metadata in unused address bits)
  • Memory safety tooling (KASAN-like in-kernel detection)
  • Capability tagging (embedding capability metadata in pointers)
LAM modes:
  LAM_U48: bits 62:48 available for metadata (15 bits, user pointers only).
  LAM_U57: bits 62:57 available for metadata (6 bits, 5-level paging mode).

  Controlled via CR3 flags: CR3.LAM_U48 or CR3.LAM_U57.
  No runtime cost: address masking is performed by hardware in the MMU pipeline.

Comparison with MTE:
  MTE (ARM):  4-bit tag per 16-byte granule. Hardware CHECKS on every access.
              Detects use-after-free, buffer overflow at runtime. ~128 cycles per
              page allocation for tag setup. Zero-cost access checks (pipelined).
  LAM (x86): 6-15 metadata bits per pointer. NO hardware checking — metadata is
              simply ignored by the MMU. Software must perform its own checks.
              Zero overhead. Useful for tooling metadata, not for runtime safety.

  Result: MTE provides stronger guarantees (hardware-enforced); LAM provides
  more flexible metadata embedding. ISLE uses both where available.

Integration: the memory allocator stores metadata in LAM bits. Debug builds use these bits for KASAN-equivalent checking. Release builds can optionally use them for capability hints.

Security caveat: Intel LAM has been disabled in the Linux kernel since v6.12 due to the SLAM attack (Spectre-based exploitation of LAM metadata bits without LASS protection). ISLE does not enable LAM unless LASS (Linear Address Space Separation) is also available on the CPU. On CPUs without LASS, the upper address bits described above are not used for metadata; KASAN-equivalent checking uses shadow memory instead. When both LAM and LASS are present, LAM is enabled with the protections described above.

19.4.1 AArch64 Pointer Authentication (PAC)

AArch64 provides Pointer Authentication Codes (PAC, ARMv8.3+) as a complementary mechanism to MTE. PAC signs pointers with a cryptographic MAC using a per-process key, detecting pointer forgery and corruption:

PAC in ISLE:
  - Return address signing: PACIASP/AUTIASP in function prologue/epilogue.
    Compiler-inserted via -mbranch-protection=pac-ret+leaf.
  - Detects ROP (Return-Oriented Programming) attacks: corrupted return
    addresses fail authentication and trap.
  - Cost: ~1 cycle per PAC/AUT instruction (pipelined). Zero memory overhead.
  - Available on: Apple M1+, AWS Graviton 3+, Cortex-A710+.

  ISLE enables PAC for all kernel code on capable hardware. This is orthogonal
  to MTE (MTE detects memory safety bugs; PAC detects control-flow hijacking).

19.5 CHERI (Future)

ARM Morello (CHERI prototype) demonstrates hardware-capability pointers with bounds checking. CHERI pointers are 128-bit: address (64) + bounds (32) + permissions (16) + flags (16). Every pointer carries its own bounds and permission information. Hardware checks on every dereference.

ISLE's capability system (Section 11.1) is a software capability model. CHERI provides a hardware capability model. When CHERI hardware is available:

Software capabilities (current):
  Kernel maintains capability table. Validated on syscall.
  Overhead: ~5-10 cycles per capability check (bitmask test).

CHERI hardware capabilities (future):
  Pointer IS the capability. Hardware validates on every access.
  Overhead: 0 cycles (pipelined with memory access).

  ISLE's capability tokens become hardware CHERI capabilities.
  The translation is natural: both use unforgeable tokens with
  bounded permissions and delegation rules.

Design for CHERI readiness: the capability system should NOT assume that capabilities are always validated in software. The validation path should be abstractable so that CHERI hardware validation can replace software validation.

CHERI Morello Status:

ARM Morello evaluation boards shipped in 2022 (based on Neoverse N1 + CHERI extensions). As of 2026, production CHERI hardware is not available. The CHERI readiness design in Section 19.5 prepares for future hardware without depending on it. When production CHERI SoCs ship, the capability validation abstraction layer enables a transition from software to hardware capability checks.

19.6 Performance Impact

MTE on ARM (when enabled): ~128 cycles per page allocation (~40% of allocator hot path). Memory access checks are hardware-pipelined: zero overhead. Linux pays the same cost when MTE is enabled.

MTE disabled (default on x86, optional on ARM): zero overhead. No code runs.

Intel LAM: zero runtime overhead (address masking is free in hardware).

CHERI (future): zero overhead (hardware-pipelined capability checks).

19.7 Hardware Fault Handler Constraints

Hardware fault handlers (machine check exceptions, bus errors, SError, NMI, system error interrupts) operate in extremely constrained contexts where normal kernel operations are forbidden. Violating these constraints causes deadlock, system hang, or recursive faults.

19.7.1 Fault Handler Categories

Hardware fault handlers fall into three categories with progressively stricter constraints:

Category Examples Context Permitted Operations
Maskable interrupts Timer tick, device IRQ IRQ context, interrupts disabled Try-lock, lock-free writes, deferred work
Synchronous faults Page fault, alignment fault, breakpoint Fault context, preemptible Blocking locks (with care), allocation (with care)
Non-maskable faults Machine Check (MCE), NMI, SError, Bus Error, System Reset NMI context, all interrupts blocked Lock-free only, per-CPU buffers, no locks

The critical distinction: maskable interrupts can be delayed by disabling interrupts, but non-maskable faults fire regardless of interrupt state. Code holding a spinlock cannot prevent an MCE or NMI from occurring.

19.7.2 Non-Maskable Fault Handler Requirements

Non-maskable fault handlers (MCE, NMI, SError, Bus Error, System Reset vectors) MUST follow these rules:

1. No blocking operations. The handler MUST NOT: - Acquire a spinlock with blocking semantics (lock() / spin_lock()) - Acquire a mutex, rwlock, or semaphore - Allocate memory (kmalloc, vmalloc, page allocation) - Sleep or yield (schedule(), wait(), condvar) - Perform I/O that may block (disk, network) - Call any function that may transitively do the above

Rationale: The fault may have interrupted code already holding locks. If the handler blocks waiting for the same lock, deadlock occurs immediately.

2. Try-lock only, with fallback. If the handler needs a lock, it MUST use try-lock (try_lock() / spin_trylock()) and handle failure:

if lock.try_lock() {
    // critical section
    lock.unlock();
} else {
    // Fallback: cannot acquire lock
    // Options: log to per-CPU buffer and continue, force reboot, degrade gracefully
}

3. Per-CPU buffers for logging. NMI/MCE handlers MUST NOT write to shared ring buffers (MPSC, printk). Instead, use a pre-allocated per-CPU buffer:

// Allocated at boot, one per CPU, never freed
static MCE_LOG: PerCpu<[MceLogEntry; 64]> = PerCpu::new([MceLogEntry::EMPTY; 64]);

// In MCE handler (NMI context):
fn mce_handler(ctx: &MceContext) {
    let log = MCE_LOG.this_cpu();
    let idx = log.head.fetch_add(1, Relaxed) % 64;
    log.entries[idx] = MceLogEntry::from_ctx(ctx);
    // Handler returns; main kernel drains log later
}

The main kernel drains these buffers after returning from the exception, outside NMI context.

4. No locks at all for NMI. NMI handlers specifically MUST NOT use any locks, even try-lock. The NMI can nest inside an MCE handler that already holds the lock, causing deadlock. NMI handlers use only: - Per-CPU variables (no sharing) - Lock-free atomic operations (atomic read/write, compare-and-swap) - Pre-mapped memory (no page faults possible)

5. Pre-allocated resources. All memory, buffers, and stacks used by NMI/MCE handlers MUST be allocated at boot time. Allocation during handler execution is forbidden. On x86-64, MCE handlers run on a dedicated IST (Interrupt Stack Table) stack, pre-allocated and never paged.

19.7.3 Deferred Recovery Actions

Any recovery action that might block MUST be deferred to a workqueue or tasklet:

MCE handler (NMI context):
  1. Capture fault context to per-CPU buffer (lock-free)
  2. Assess severity: recoverable vs. fatal
  3. If recoverable:
     a. Log to per-CPU buffer
     b. Set flag: NEEDS_RECOVERY = true
     c. Return from exception
  4. If fatal:
     a. Log to per-CPU buffer
     b. Trigger immediate reboot (no locking)

Workqueue (thread context, after NMI returns):
  1. Check NEEDS_RECOVERY flag
  2. If set:
     a. Drain per-CPU MCE log to kernel log (may block)
     b. Initiate memory offlining (may block)
     c. Notify userspace via netlink (may block)
     d. Clear NEEDS_RECOVERY flag

The workqueue runs in normal thread context where blocking operations are safe. The NMI handler does the minimum work needed to capture state and flag the need for recovery.

19.7.4 Architecture-Specific Fault Types

Architecture Non-Maskable Fault Types Vector / Entry Point
x86-64 Machine Check Exception (#MC), NMI IDT vector 18 (MCE), vector 2 (NMI)
AArch64 SError Interrupt, Physical IRQ (FIQ) VBAR_EL1 offset 0x180 (SError)
ARMv7 Data Abort (imprecise), FIQ VBAR offset 0x1C (FIQ), 0x10 (Data Abort)
RISC-V 64 NMI (platform-specific) Platform-defined; often traps to mtvec in M-mode
PPC32 Machine Check, Critical Interrupt IVOR[10] (MCE), IVOR[15] (Critical)
PPC64LE Machine Check, System Reset HSRR0/HSRR1 vectors, LPCR-defined

All handlers for these vectors MUST follow the non-maskable fault handler requirements in Section 19.7.2.

19.7.5 Recursive Fault Prevention

Hardware fault handlers MUST prevent recursive faults:

1. Guard pages. Handler stacks have guard pages (unmapped) at both ends. Stack overflow causes an immediate fault rather than corrupting adjacent memory.

2. Handler re-entry detection. Each handler checks a per-CPU flag on entry:

fn mce_handler(ctx: &MceContext) {
    if MCE_IN_PROGRESS.this_cpu().swap(true, Relaxed) {
        // Already in MCE handler — recursive fault
        // Cannot log (might fault again), cannot recover
        // Immediate halt to prevent infinite recursion
        arch::halt_loop();
    }
    // ... normal handler logic ...
    MCE_IN_PROGRESS.this_cpu().store(false, Release);
}

3. Pre-pinned code. Handler code and data pages are pinned in memory (never paged out). A page fault during NMI/MCE handling would cause a double fault.