Sections 37–38 of the ISLE Architecture. For the full table of contents, see README.md.

Part IX: Virtualization

Host-guest integration and suspend/resume.

37. Host and Guest Integration

How ISLE behaves as a VM host (via isle-kvm) and as a guest kernel running inside a hypervisor. This section covers virtio device negotiation, paravirtual optimizations, vhost data plane, and live migration.

Guest Mode — Virtio Device Negotiation

When ISLE runs as a guest kernel, it discovers virtio devices via PCI or MMIO transport and negotiates feature bits with the hypervisor. The virtio drivers (virtio-blk, virtio-net, virtio-gpu, virtio-console — already listed as Priority 1 in Section 7) implement the standard virtio 1.2 specification (approved as an OASIS Committee Specification in July 2022), with forward-compatible support for virtio 1.3 features as that draft is finalized.

Guest Mode — Paravirtual Clock

Hardware RDTSC inside a VM can be inaccurate (the TSC may not be invariant, or vmexit overhead distorts time). Paravirtual clock avoids this:

KVM pvclock / kvmclock: the hypervisor maps a shared memory page containing clock parameters (scale, offset, version). The guest reads time from this page — no vmexit required. ISLE's clocksource subsystem auto-detects and prefers pvclock when running as a KVM guest.
Hyper-V TSC page: equivalent mechanism for Hyper-V hosts. Same principle — shared memory page, no hypercall for time reads.
Fallback: if neither paravirt clock is available, ISLE uses the ACPI PM timer (slow but always accurate) or PIT (ancient but universal).

Guest Mode — Balloon Driver

virtio-balloon enables dynamic memory adjustment — the hypervisor can reclaim guest memory by inflating the balloon (guest returns pages) or release memory by deflating it. ISLE integrates balloon inflation with its memory pressure framework: - Balloon inflation is treated as memory pressure, triggering the same reclaim path as physical memory exhaustion (page cache eviction, slab shrinking, swap-out) - Balloon deflation immediately makes pages available to the buddy allocator - This unified pressure model means ISLE's OOM decisions correctly account for ballooned-away memory

Guest Mode — PV Spinlocks

Under overcommitted VMs, spinning on a lock held by a descheduled vCPU wastes host CPU cycles (the spinning vCPU can never acquire the lock until the holder is scheduled). ISLE detects the hypervisor type at boot: - KVM: replaces spin loops with KVM_HC_KICK_CPU hypercall — yields the vCPU and requests the host to schedule the lock holder - Hyper-V: uses HvCallNotifyLongSpinWait hypercall — notifies the hypervisor of a long spin wait, allowing it to schedule the lock holder - Bare metal: standard spin loops (no overhead when not virtualized)

Guest Mode — Hypervisor-Specific Backends

Hypervisor	Paravirt Features
KVM (primary)	pvclock, PV spinlocks, PV TLB flush, steal time accounting, async PF
Hyper-V	Synthetic interrupts, synthetic timer, APIC assist, TSC page, PV spinlocks
Xen PV	Future — xenbus, grant tables, PV disk/net (lower priority)

Guest Mode — Cloud Metadata

Cloud-init and instance metadata (AWS IMDSv2, Azure IMDS, GCP metadata server) are consumed by userspace agents. The kernel's role is providing transport: - vsock (virtio-socket) for hypervisor↔guest communication without networking - virtio-serial for structured host↔guest channels - Standard networking for HTTP-based metadata endpoints (169.254.169.254)

vhost Kernel Data Plane

vhost moves the virtio data plane into the host kernel, bypassing the VMM (QEMU) for hot-path I/O:

vhost-net: kernel-side virtio-net processing. Packets move directly between the guest's virtio ring and the host's tap/macvtap device via kernel. The VMM handles only control plane (device configuration, feature negotiation). Implemented as a Tier 1 isle-kvm module.
vhost-scsi: kernel-side virtio-scsi processing for direct block device access from guests, bypassing QEMU's I/O path. Guests see near-native block device performance.
vhost-user: protocol for offloading vhost processing to userspace daemons (DPDK for networking, SPDK for storage). This is handled entirely in userspace by the VMM (e.g., QEMU) which shares guest memory via memfd with the backend daemon. The ISLE kernel does not implement vhost-user directly; it simply provides the standard shared memory and unix domain socket primitives required for QEMU to function.
vhost-vsock: host↔guest communication channel using the vsock address family. No networking stack required — communication uses a simple stream/datagram protocol over shared memory.

VM Live Migration (KVM)

Live migration moves a running VM from one physical host to another with minimal downtime. ISLE's isle-kvm implements the full migration pipeline:

Pre-copy phase: Track dirty pages via Intel PML (Page Modification Logging) or manual dirty bitmap scanning. Isle-kvm reads the PML buffer on a timer interrupt and transmits dirty pages to the destination host.
Iterative convergence: Multiple pre-copy rounds, each sending pages dirtied since the last round. Configurable maximum downtime target (e.g., 50ms).
Auto-converge: If the guest's dirty rate exceeds the network transfer rate (migration won't converge), isle-kvm throttles vCPU execution to reduce the dirty rate. This is automatic and transparent to the guest.
Stop-and-copy: When the remaining dirty set is small enough to transfer within the downtime target, the VM is paused, final dirty pages are sent, and the destination resumes execution.
Post-copy (optional): The destination VM starts running immediately. Pages not yet transferred are faulted in on demand via userfaultfd. ISLE integrates this with its page fault handler (Section 12) — a post-copy page fault is handled like any other demand fault, but the backing store is the source host instead of disk.

Guest-side migration support — When ISLE runs as a guest: - PV migration notifier: guest receives a pre-migration hint via virtio, allowing it to flush caches, pause background I/O, and prepare for the brief freeze - Post-migration re-enumeration: guest re-enumerates PCI topology (in case of heterogeneous migration to different hardware), re-calibrates pvclock, resumes I/O - Confidential VM migration: handled by the TEE framework (Section 26.4)

Host Mode — Cloud Orchestration

ISLE provides /dev/kvm and the associated ioctl interface, making it compatible with the standard KVM ecosystem: - libvirt: standard virtualization management library. Works unmodified — it talks to /dev/kvm via standard ioctls. - OpenStack Nova: compute driver talks to libvirt, libvirt talks to /dev/kvm. ISLE is transparent to the orchestration layer. - QEMU and Firecracker: both use /dev/kvm directly. Both work unmodified on ISLE.

Recovery advantage — ISLE's driver recovery provides unique benefits for virtualization: - Host-side: if a vhost-net or vhost-scsi module crashes, ISLE recovers it in-place (Tier 1 reload). The hypervisor and guest never notice. In Linux, a vhost crash would require tearing down and re-establishing the vhost connection. - Guest-side: if a guest running ISLE crashes a virtio driver, the driver recovers without VM reboot. The hypervisor sees a brief pause in I/O but no reset. In Linux, a guest virtio driver crash typically requires VM reboot.

38. Suspend and Resume

Linux problem: Suspend/resume on laptops was notoriously unreliable for years. Driver suspend/resume callbacks are fragile — one broken driver blocks the entire system.

ISLE design:

38.1 Suspend Modes

s2idle (suspend-to-idle): Primary suspend mode. Freezes all userspace processes, puts devices into low-power states, and halts CPUs in their deepest idle state. Does not require firmware cooperation (no ACPI S3 handoff), making it more reliable than traditional suspend-to-RAM. Wake sources: any enabled interrupt (keyboard, network, RTC alarm, lid switch).
S3 (suspend-to-RAM): Fallback for platforms where s2idle power consumption is unacceptable. CPU and device state are saved to RAM, then firmware is called to power down the platform. Requires ACPI S3 support and correct firmware implementation.
S4 (hibernate / suspend-to-disk): Full system image is written to a swap partition or dedicated hibernate file, then the system powers off completely. On resume, the bootloader loads the kernel, which restores the saved image into memory and resumes execution. Hibernate support depends on the block I/O layer (Section 29) being available and a configured swap/hibernate target.

38.2 Device State Save/Restore Ordering

Device suspend follows the device dependency tree in leaf-to-root order (children before parents). Resume follows root-to-leaf order (parents before children, the reverse of suspend). This ensures that a child device never attempts I/O on a parent that is already suspended, and that parent buses are active before children attempt to re-initialize.

The ordering algorithm:

Build topological order from the device dependency tree (Section 7, device registry).
Suspend phase — traverse the tree bottom-up:
For each device, invoke its KABI suspend() callback with a per-device timeout (default: 5 seconds, configurable via sysfs).
If the callback does not complete within the timeout, the driver is forcibly stopped (Tier 1/Tier 2 driver recovery — the driver's isolation domain is torn down). The device is marked as "failed to suspend" and will be re-initialized from scratch on resume.
DMA engines are quiesced before their parent bus controller suspends.
Interrupt controllers are suspended last (after all device interrupts are masked).
Resume phase — traverse the tree top-down:
Bus controllers and interrupt controllers are restored first.
For each device, invoke its KABI resume() callback.
Devices that failed to suspend are re-initialized via the standard driver probe path rather than the resume path.
Drivers that fail to resume within the timeout are forcibly restarted, same as suspend failures.

38.3 CPU State Save/Restore

On suspend, the kernel saves per-CPU state that is not preserved by hardware across the suspend/resume cycle:

General-purpose registers: Saved to a per-CPU save area in the kernel BSS. On resume, the boot CPU restores its own state and brings up secondary CPUs via the normal SMP bringup path (Section 17), which re-initializes their register state.
System registers / MSRs: Architecture-specific system register state that must be explicitly restored:
x86_64: IA32_EFER, IA32_STAR, IA32_LSTAR, IA32_FMASK (syscall registers), IA32_PAT, IA32_KERNEL_GS_BASE, GDT/IDT/TR descriptors, CR0/CR3/CR4, debug registers (DR0-DR7), XCR0 (XSAVE state), IA32_SPEC_CTRL (Spectre mitigations)
AArch64: SCTLR_EL1, TCR_EL1, TTBR0/TTBR1_EL1, MAIR_EL1, VBAR_EL1, TPIDR_EL1, CNTKCTL_EL1, CPACR_EL1
RISC-V: satp, stvec, sscratch, sie, sstatus
FPU/SIMD state: Saved via XSAVE (x86_64), STP of Q0-Q31 + FPCR/FPSR (AArch64), or architecture-specific equivalent. Eager FPU restore is used on resume — FPU state is restored immediately on all CPUs before executing any userspace or untrusted code, preventing the CVE-2018-3665 (LazyFP) speculative execution side-channel vulnerability.

38.4 Memory Handling

s2idle and S3: RAM remains powered. No memory save/restore is needed. The kernel only needs to ensure that all dirty cache lines are flushed to RAM before the CPU enters the suspended state (WBINVD on x86_64, DC CIVAC + DSB on AArch64).
S4 (hibernate): The kernel creates a consistent snapshot of all in-use memory pages using a two-phase freeze-and-snapshot approach:
Freeze phase: All userspace processes are frozen (SIGSTOP equivalent). All Tier 1 and Tier 2 drivers (except the storage stack required for the hibernate target) are suspended via the suspend path (Section 38.2), which quiesces their DMA activity. After this point, no new memory modifications occur except from the snapshot code and the active storage drivers.
Snapshot phase: With all sources of concurrent modification stopped, the kernel walks the page frame allocator's used-page bitmap. Free pages (tracked by the buddy allocator) are excluded. Each in-use page is compressed (LZ4, matching Linux's default hibernate compressor) and written to the configured hibernate target (swap partition or file). The snapshot code runs on the boot CPU only, with interrupts disabled except for the disk I/O completion interrupt.
Integrity and authentication: The hibernate image is cryptographically authenticated to prevent tampering:
- A SHA-256 hash of the compressed image is computed during the write.
- The hash is signed (not merely stored) using a TPM-backed key if a TPM is available (Section 23), or encrypted with a symmetric key derived from the kernel's boot secret (Section 22.3) if no TPM is present.
- On resume, the signature is verified (TPM path) or the hash is decrypted and validated (boot-secret path) before any pages are restored.
- This prevents an attacker with disk write access from substituting a malicious hibernate image, as they cannot forge valid authentication without access to the TPM or the boot secret.
On resume, the bootloader loads a fresh kernel, which reads the hibernate image from disk, verifies the hash, and allocates intermediate safe memory (bounce frames) to hold the image. It then carefully copies the saved pages to their original physical addresses, taking care to avoid overwriting the fresh kernel's own executing code or page tables, and jumps to the restore entry point. The resumed kernel then re-initializes devices via the resume path described in Section 38.2.

38.5 Timer Re-synchronization

System clocks drift or lose state during suspend. On resume, the kernel must re-synchronize all time sources:

Read the hardware RTC (CMOS on x86, PL031 on ARM, or platform-specific RTC) to determine wall-clock time elapsed during suspend.
Adjust CLOCK_BOOTTIME offset by the elapsed suspend duration so that it reflects total wall-clock time since boot, including suspend. CLOCK_MONOTONIC is not adjusted — it does not count time spent in suspend, matching Linux semantics. CLOCK_BOOTTIME includes suspend time by definition.
Re-calibrate TSC / arch timer: On x86, re-read IA32_TSC_ADJUST if available, or re-synchronize TSC across CPUs via the TSC synchronization protocol (Section 16a). On AArch64, the generic timer (CNTPCT_EL0) typically survives S3 suspend but must be verified on resume. On platforms with paravirtual clocks (KVM pvclock, Hyper-V TSC page), the shared clock page is re-read to pick up updated scale/offset values.
Fire expired timers: All pending hrtimer and timer_list entries are checked against their respective updated time bases. Timers armed against CLOCK_BOOTTIME or CLOCK_REALTIME that expired during suspend are fired immediately in a batch. Timers armed against CLOCK_MONOTONIC are evaluated against the unadjusted monotonic clock and will not fire prematurely.
Notify userspace: A CLOCK_REALTIME discontinuity notification is sent to processes using timerfd or clock_nanosleep so they can adjust.

38.6 Interrupt Controller State

Interrupt controller state is saved on suspend and restored before any device resume callbacks are invoked:

x86_64 (APIC): Save and restore the Local APIC registers (LVT entries, TPR, timer configuration, spurious interrupt vector). The I/O APIC redirection table entries are saved per-pin. MSI/MSI-X vectors are re-programmed by the PCI subsystem during device resume — the interrupt controller layer saves the IRQ-to-vector mapping so that devices resume with the same interrupt vectors they had before suspend.
AArch64 (GICv3): Save and restore the GIC Distributor (GICD), Redistributor (GICR), and CPU Interface (ICC) state. GICv3 defines standard save/restore registers for this purpose.
RISC-V (PLIC/APLIC): Save per-source priority, per-context enable bits, and threshold registers.

The kernel disables all interrupts (except the wake source IRQ) before entering the final suspend state, and re-enables them after interrupt controller state is restored on resume.