Skip to content

Sections 49–54 of the ISLE Architecture. For the full table of contents, see README.md.


Part XIII: Advanced Capabilities

Power budgeting, formal verification, kernel extensibility, live evolution, intent-based management, and unified compute.

These sections cover hardware trends and OS theory advances that must be designed into ISLE from the start to avoid costly retrofits. All features satisfy three constraints:

  1. Drop-in compatibility: No feature changes the syscall surface, /proc, /sys, /dev, cgroup knobs, or ioctl behavior. Features are kernel-internal or expose new additive interfaces alongside existing Linux ones.
  2. Zero performance regression: No feature adds measurable overhead to hot paths compared to Linux on the same hardware.
  3. IP-clean: All features are based on published research, open specifications, or original design.

49. Power Budgeting

49.1 Problem

Datacenters in 2026 are power-wall limited. A rack has a fixed power budget (typically 20-40 kW). Power, not compute, is the scarce resource.

Linux has power management (cpufreq, DVFS, C-states, RAPL readout) but no power budgeting. There is no way to say "this container gets at most 150W total across CPU, GPU, memory, and NIC." There is no way for the scheduler to make holistic power-performance tradeoffs.

Relationship to Section 14.5 (Heterogeneous CPU / EAS): Section 14.5 covers Energy-Aware Scheduling at the per-task level — selecting the most energy-efficient core type (P-core vs E-core) for each task using OPP tables and PELT utilization. This section covers a complementary concern: per-cgroup power budgeting — enforcing total watt caps across all power domains (CPU + GPU + DRAM + NIC). The two mechanisms interact: EAS picks the optimal core, power budgeting enforces the envelope.

49.2 Design: Power as a Schedulable Resource

Power joins CPU time, memory, and accelerator time as a kernel-managed resource with cgroup integration.

// isle-core/src/power/budget.rs

/// Maximum number of power domains tracked by the power budgeting subsystem.
/// A typical datacenter server has:
///   - 1-2 CPU packages (CpuPackage)
///   - 8-128 CPU cores (CpuCore, if per-core RAPL is available)
///   - 1-2 DRAM controllers (Dram)
///   - 0-8 GPUs/accelerators (Accelerator)
///   - 1-4 NICs (Nic, if power-metered)
///   - 1-8 NVMe SSDs (Storage, if power-metered)
///   - 1 platform-level domain (Platform)
/// Setting 256 covers high-end servers with per-core monitoring enabled.
/// The ArrayVec avoids heap allocation on the tick hot path.
pub const MAX_POWER_DOMAINS: usize = 256;

/// Maximum number of cgroups tracked by the power budgeting subsystem.
/// Cgroups are typically hierarchical; a large server may have:
///   - 1 root cgroup
///   - 10-100 system.slice cgroups (systemd services)
///   - 10-1000 user.slice cgroups (user sessions, containers)
/// Setting 4096 covers large container hosts without excessive memory.
/// The FixedHashMap uses open addressing with power-of-two sizing.
pub const MAX_POWER_CGROUPS: usize = 4096;

/// Per-device power domain.
/// The kernel reads current power draw from hardware (RAPL, SCMI, ACPI).
pub struct PowerDomain {
    /// Domain identifier (matches device registry node).
    pub device_id: DeviceNodeId,

    /// Domain type.
    pub domain_type: PowerDomainType,

    /// Current power draw (milliwatts, updated every tick).
    pub current_mw: AtomicU32,

    /// Maximum power this domain can draw (TDP or configured limit).
    /// Initialized from ACPI PPCC (Participant Power Control Capabilities)
    /// tables where available; these define hardware power limits that
    /// the OS must respect. Falls back to TDP from CPUID/ACPI otherwise.
    pub max_mw: u32,

    /// Current performance level (0 = lowest power, 100 = maximum).
    pub perf_level: AtomicU32,

    /// Power measurement source.
    pub measurement: PowerMeasurement,
}

#[repr(u32)]
pub enum PowerDomainType {
    /// CPU package (includes all cores and uncore).
    CpuPackage  = 0,
    /// CPU core subset (per-core RAPL on Intel).
    CpuCore     = 1,
    /// DRAM (memory controller).
    Dram        = 2,
    /// GPU / accelerator.
    Accelerator = 3,
    /// NIC (if power-metered).
    Nic         = 4,
    /// NVMe SSD (if power-metered).
    Storage     = 5,
    /// Entire system (platform-level RAPL or BMC).
    Platform    = 6,
}

#[repr(u32)]
pub enum PowerMeasurement {
    /// Intel RAPL (Running Average Power Limit) via MSR.
    IntelRapl       = 0,
    /// AMD RAPL equivalent.
    AmdRapl         = 1,
    /// ARM SCMI (System Control and Management Interface).
    ArmScmi         = 2,
    /// ACPI Power Meter device.
    AcpiPowerMeter  = 3,
    /// BMC/IPMI (out-of-band, lower frequency).
    BmcIpmi         = 4,
    /// Estimated from utilization (no hardware meter).
    Estimated       = 5,
}

Per-architecture power measurement details:

Intel/AMD RAPL (x86):
  - Read via MSR: IA32_PKG_ENERGY_STATUS (package), IA32_PP0_ENERGY_STATUS (cores),
    IA32_DRAM_ENERGY_STATUS (DRAM), IA32_PP1_ENERGY_STATUS (GPU/uncore).
  - Resolution: ~15.3 μJ per LSB (Intel), ~15.6 μJ (AMD).
  - Read cost: ~100ns per MSR read. 6 domains × 100ns = 600ns per tick.
  - Per-core RAPL (Intel Raptor Lake+): per-CPU energy attribution. Enables
    precise per-cgroup power accounting without proportional estimation.
  - Overflow: 32-bit energy counters. Overflow interval depends on the CPU model's
    energy unit and current power draw — it MUST be computed at runtime. At boot,
    the kernel reads MSR_RAPL_POWER_UNIT to extract the energy unit (bits 12:8),
    giving energy_unit_joules = 2^(-ESU) (e.g., ESU=14 → ~61 μJ on Haswell
    and later (including Skylake), ESU=16 → ~15.3 μJ on Sandy Bridge / Ivy
    Bridge (the architecture default)). The overflow interval is then:
      overflow_seconds = 2^32 * energy_unit_joules / current_power_watts
    For a 200W package with ESU=16: 2^32 * 15.3e-6 / 200 ≈ 329 seconds.
    For a 500W package with ESU=14: 2^32 * 61e-6 / 500 ≈ 524 seconds.
    The kernel sets the RAPL polling interval to min(overflow_seconds / 2, tick)
    to guarantee no counter wraparound is missed. With a 4ms tick and typical
    overflow intervals of 329-524 seconds, the formula always evaluates to `tick`
    (4ms) on current hardware — the overflow margin is enormous. The calculation
    is still performed at runtime (not assumed) to handle hypothetical hardware
    where very high power draw or very coarse energy units could produce an
    overflow interval shorter than twice the tick period.

ARM SCMI (AArch64/ARMv7):
  - SCMI (System Control and Management Interface, ARM DEN 0056) is a standardized
    protocol for communication between the OS and a System Control Processor (SCP).
  - Power domains are discovered via SCMI_POWER_DOMAIN_ATTRIBUTES (protocol 0x11, message 0x03).
  - Power measurement: SCMI_SENSOR_READING_GET (protocol 0x15, message 0x06) reads sensor values
    from the SCP. Sensor types include POWER (watts), ENERGY (joules), CURRENT (amps).
  - Read cost: ~1-5 μs per SCMI message (shared memory + doorbell interrupt to SCP).
    Higher than RAPL (~100ns) but still within the 4ms tick budget.
  - Available on: ARM SBSA servers (AWS Graviton, Ampere), Cortex-M SCP-based
    platforms, and any SoC implementing SCMI power management.
  - Fallback: If SCMI is not available (e.g., simple embedded boards without SCP),
    fall back to Estimated mode.

  Power domain mapping:
    SCMI domain ID → ISLE PowerDomain:
      - SCMI_POWER_DOMAIN type "CPU" → PowerDomainType::CpuPackage or CpuCore
      - SCMI_POWER_DOMAIN type "GPU" → PowerDomainType::Accelerator
      - SCMI_POWER_DOMAIN type "MEM" → PowerDomainType::Dram
      - Platform-level SCMI sensor → PowerDomainType::Platform

RISC-V SBI PMU:
  - RISC-V has no standard power measurement interface. The SBI PMU extension
    (ratified) provides performance counters but not power counters.
  - On platforms with BMC/IPMI (e.g., datacenter RISC-V): use BmcIpmi source.
  - On platforms without any power measurement: use Estimated mode.
  - Future: the RISC-V power management task group is defining power management
    extensions. ISLE will adopt these when ratified.

Estimated (fallback for all architectures):
  - When no hardware power meter is available, ISLE estimates power from:
    a. CPU utilization × TDP per core (linear model, ~10% accuracy).
    b. Frequency scaling: power ∝ V² × f. Frequency from cpufreq.
    c. C-state residency: idle cores at deep C-states draw ~0.5-2W.
  - Estimation runs in the scheduler tick handler (zero additional overhead).
  - Accuracy: ±20-30% vs actual hardware measurement. Sufficient for coarse
    power budgeting (e.g., "keep this rack under 30kW") but not for fine-grained
    per-cgroup accounting.
  - The estimated source is logged at boot:
    isle: power: No hardware power meter detected, using estimated power model
    isle: power: Estimated power accuracy: ±25%. Consider RAPL/SCMI hardware.

49.3 Cgroup Integration

/sys/fs/cgroup/<group>/power.max
    # Maximum total power budget for this cgroup (milliwatts).
    # Enforced across ALL power domains (CPU + GPU + memory + NIC).
    # Format: "150000" (150W)
    # "max" = no limit (default)

/sys/fs/cgroup/<group>/power.current
    # Current power draw by this cgroup (milliwatts, read-only).
    # Sum of all power domains attributed to this cgroup's processes.

/sys/fs/cgroup/<group>/power.stat
    # Power statistics:
    #   energy_uj <total energy consumed in microjoules>
    #   throttle_count <times power budget was exceeded and throttled>
    #   throttle_us <total microseconds spent throttled>
    #   avg_power_mw <average power over last 10 seconds>

/sys/fs/cgroup/<group>/power.weight
    # Relative share of excess power budget (like cpu.weight).
    # Default: 100. Higher = more power when contended.

/sys/fs/cgroup/<group>/power.domains
    # Per-domain power limits (optional, for fine-grained control).
    # Format: "cpu 80000 gpu 60000 dram 10000"
    # If not set, the global power.max is split by the kernel
    # based on workload demand.

49.4 Power-Aware Scheduler

// isle-core/src/sched/power.rs

/// Power budget enforcer.
/// Runs at scheduler tick frequency (~4ms). NOT per-scheduling-decision.
pub struct PowerBudgetEnforcer {
    /// Power domains on this machine.
    /// Populated at boot from ACPI/DT hardware discovery. The number of power
    /// domains is small and bounded (typically <=16: package + cores + DRAM +
    /// accelerators). Uses a fixed-capacity array sized to MAX_POWER_DOMAINS.
    domains: ArrayVec<PowerDomain, MAX_POWER_DOMAINS>,

    /// Per-cgroup power accounting.
    /// Uses a fixed-size hash table (open addressing, power-of-two size) rather
    /// than BTreeMap: this runs at tick frequency (~4ms) so O(1) average-case
    /// lookup is required. Maximum cgroup count is bounded by the cgroup hierarchy
    /// (typically <1024 active cgroups). Resized only on cgroup creation/deletion,
    /// never on the tick hot path.
    cgroup_power: FixedHashMap<CgroupId, CgroupPowerState, MAX_POWER_CGROUPS>,

    /// Global power budget (rack-level, from BMC or admin-configured).
    global_budget_mw: Option<u32>,
}

pub struct CgroupPowerState {
    /// Budget for this cgroup (from power.max).
    budget_mw: u32,

    /// Current attributed power draw.
    current_mw: u32,

    /// Running energy counter (microjoules).
    energy_uj: u64,

    /// Is this cgroup currently throttled?
    throttled: bool,

    /// Throttle mechanism:
    /// 1. Reduce CPU frequency (cpufreq) for this cgroup's cores.
    /// 2. Reduce accelerator clock (AccelBase set_performance_level).
    /// 3. As last resort: CPU throttling (delay scheduling).
    /// At most one action per power domain type (CPU freq, accel perf, CPU throttle),
    /// so bounded to MAX_POWER_DOMAINS.
    throttle_actions: ArrayVec<ThrottleAction, MAX_POWER_DOMAINS>,
}

pub enum ThrottleAction {
    /// Reduce CPU frequency to this level (MHz).
    CpuFrequency(u32),
    /// Reduce accelerator performance level.
    AccelPerformance { device_id: DeviceNodeId, level: u32 },
    /// Throttle CPU time (insert idle cycles).
    CpuThrottle { duty_cycle_percent: u32 },
}

Enforcement flow:

Every scheduler tick (~4ms):
  1. Read power counters from all domains (1 MSR read per domain, ~100ns each).
  2. Attribute power to cgroups based on CPU time + accelerator time share.
     Note: power attribution is an APPROXIMATION. RAPL gives package-level
     power, not per-process. Attribution model:
       Per-core RAPL (Intel, where available): precise per-CPU attribution.
       Package-level RAPL: proportional to (cgroup CPU time / total CPU time)
         weighted by frequency at time of execution.
       Accelerator: proportional to AccelBase get_utilization() per cgroup.
     This is the same limitation as Linux (perf energy-cores event).
     Exact per-process power metering requires hardware not yet available.
  3. For each cgroup:
     a. Is current_mw > budget_mw?
     b. Yes → apply throttle actions (reduce frequency, limit scheduling).
     c. No → release any active throttles.
  4. Total overhead: ~1 μs per tick. Fraction of the 4ms tick: 0.025%.

49.5 System-Level Power Accounting

/sys/kernel/isle/power/
    energy_total_uj        # Total system energy since boot (microjoules)
    budget_mw              # System-wide power budget (admin-set)

Carbon policy is NOT the kernel's job. The kernel measures watts and enforces watt budgets. Carbon intensity depends on grid mix, geography, time of day, renewable contracts — all external to the machine. Orchestrators (Kubernetes, Nomad, custom fleet managers) can read energy_total_uj and compute carbon externally. This is the correct separation of concerns: kernel provides accurate power telemetry, userspace applies policy.

Thermal Throttling Coordination:

When the hardware thermal throttle engages, the power budget enforcer backs off to prevent double-throttling. Detection is architecture-specific: - x86: MSR IA32_THERM_STATUS (PROCHOT assertion) or ACPI thermal zone events. - AArch64/ARMv7: SCMI thermal notifications from SCP, or ACPI thermal zones on SBSA-compliant servers. On DT-based platforms: thermal zone DT nodes with trip points. - RISC-V: Platform-specific (BMC/IPMI thermal events, or DT thermal zones).

If hardware is already throttling a domain, the kernel does not apply additional software throttling to that domain — doing so would reduce performance below what the thermal situation requires. The kernel logs the thermal event and adjusts its power model to account for reduced headroom.

Hardware/software throttle coordination: To prevent double-throttling during the detection window, the power enforcer reads the hardware throttle status BEFORE applying its own throttle. On x86, IA32_THERM_STATUS bit 0 (PROCHOT active) and IA32_PACKAGE_THERM_STATUS indicate active hardware throttling. On ARM, SCMI notifications deliver thermal events asynchronously. The coordination protocol: 1. Before each enforcement tick, read hardware throttle status. 2. If hardware throttling is active, skip software throttling for this tick (hardware is already reducing power draw). 3. If hardware throttling was active on the previous tick but is now inactive, re-evaluate software throttle based on current power measurement (the RAPL or SCMI reading now reflects the hardware-throttled power level). 4. Race window: between hardware engaging thermal throttle and the next enforcement tick (~4 ms worst case), both throttles may be active simultaneously. This is safe — double-throttling reduces performance temporarily but does not cause correctness issues. The next tick detects the hardware throttle and removes the software throttle.

Battery Systems:

Power budgeting for battery-powered systems (laptops, edge devices) is out of scope for v1. Battery charge level, discharge rate, and remaining runtime are platform-management concerns handled by ACPI/UPower in userspace. The power budgeting system provides the watt-level telemetry that battery management software can consume, but does not implement battery-specific policies.

49.6 Performance Impact

Per-architecture overhead per scheduler tick (~4ms):

Architecture Read mechanism Cost per domain 6-domain system Overhead
x86 (RAPL) MSR read ~100ns 600ns 0.015%
AArch64 (SCMI) SCP mailbox ~1-5 μs 6-30 μs 0.15-0.75%
ARMv7 (SCMI) SCP mailbox ~1-5 μs 6-30 μs 0.15-0.75%
RISC-V (Estimated) Calculation ~50ns 300ns 0.008%
PPC32 (Estimated) Calculation ~50ns 300ns 0.008%
PPC64LE (OCC) OPAL sensor read ~1-5 μs 6-30 μs 0.15-0.75%
Any (BMC/IPMI) OOB polling ~10-50 μs 60-300 μs 0.006-0.03% (rate-limited to 1/s)

SCMI overhead is higher than RAPL but still well within budget. For BMC/IPMI sources, the kernel rate-limits reads to 1 per second (not per tick) to avoid I2C/IPMI bus saturation, using the last-read value for inter-read ticks. The overhead percentage for BMC/IPMI reflects amortization over the 1-second read interval (60-300 μs / 1s), not per-tick cost.

When power throttling is active: performance reduction is intentional and configured. It replaces uncontrolled thermal throttling (which is worse — it's sudden and undifferentiated).

When power throttling is NOT active: zero overhead beyond the power reads.


50. Formal Verification Readiness

50.1 The Opportunity

Formal verification of kernel code crossed the practical threshold:

2009: seL4 — 200,000 lines of proof for 10,000 lines of C. Heroic effort.
2018: RustBelt — Formal soundness proof for Rust's ownership model.
2022-2025: Verus (Carnegie Mellon University, VMware Research, Microsoft Research,
  ETH Zurich, and others) — Automated verification for Rust.
  Write Rust code + specifications → tool PROVES correctness.
  Not testing. Not fuzzing. Mathematical machine-checked proof.

Verus can verify Rust code of realistic complexity: concurrent data structures, state machines, protocols, invariant maintenance. ISLE is written in Rust. The verification infrastructure exists.

50.2 What To Verify

Not everything needs verification. Focus on security-critical invariants and concurrency-sensitive code where bugs have catastrophic consequences:

Component Invariant to Prove Section
Capability system Capabilities cannot be forged. Revocation is complete. Permissions never escalate. Section 11.1
Page table management No page mapped into two processes simultaneously without explicit sharing. Freed pages never accessible. Section 12
Memory allocator No page allocated twice. No double-free. Buddy merging preserves free-list consistency. Allocation never returns memory outside tracked ranges. Section 12
KABI vtable dispatch Vtable calls never escape the driver's isolation domain. Version checks are correct. Section 5
IPC ring buffer Producer-consumer protocol never loses messages, never delivers duplicates, never deadlocks. Section 8
CBS bandwidth server Bandwidth guarantees are met. No starvation. Section 15.4
DSM coherence protocol Multiple-reader / single-writer consistency maintained. No lost writes. Section 47.5
Distributed capabilities Signature verification is correct. Revocation propagation is complete. Section 47.9
Power budget enforcement Budgets are never exceeded by more than one tick interval. §49

50.3 Design for Verifiability

Verification readiness is a design property, not a tool. Code must be structured so that specifications can be written and verified:

// Example: capability lookup with verification-ready specification.
// Verus-style annotations (compile-time only, erased from binary).

/// Lookup a capability by handle.
///
/// SPECIFICATION (verified by Verus):
///   requires: handle is valid for calling process
///   ensures:  returned capability matches the one in the capability table
///   ensures:  returned capability's generation <= object's current generation
///   ensures:  returned capability's permissions are a subset of the
///             delegator's permissions (no escalation)
pub fn cap_lookup(
    table: &CapabilityTable,
    process: ProcessId,
    handle: CapHandle,
) -> Result<Capability, CapError> {
    // Implementation must satisfy the specification.
    // Verus proves this at compile time.
    // No runtime overhead.
}

Design rules for verifiability:

  1. Explicit state: No hidden mutable global state. All state is in named structures with explicit ownership. (Rust already enforces this.)

  2. Small critical sections: Break complex operations into small, individually verifiable steps. Each step has a pre-condition and post-condition.

  3. Interface contracts: Every public function in security-critical modules has a documented specification (pre/post conditions, invariants). Verus verifies these.

  4. Algebraic data types for states: Use enums with exhaustive matching instead of integer flags. The type checker ensures all states are handled.

  5. Monotonic counters: Generation counters, version numbers — use types that enforce monotonicity (can only increase, never decrease).

50.4 Verification Tooling

Primary tool: Verus (Carnegie Mellon University, VMware Research, Microsoft Research, and others). Automated verification for Rust. Specification-driven proofs of functional correctness and memory safety properties.

Alternative tools (fallback if Verus hits scale limits): - Kani (Amazon): Bounded model checking for Rust. Explores all execution paths up to a configurable bound. Excellent for concurrent code and finding edge cases. Complementary to Verus — Kani finds bugs, Verus proves absence of bugs. - Prusti (ETH Zurich): Automated verification for Rust. Different proof strategy than Verus (separation logic vs SMT). Useful as a cross-check.

CI integration strategy: - Every commit: debug_assert! invariant checks + lightweight type-level assertions. Compile-time only. Seconds. Catches regressions in verified invariants. - Every PR: Kani bounded model checks on critical modules (~5-10 min). Catches concurrency bugs and edge cases. - Nightly: Full Verus specification proofs (~30-60 min for verified modules). Mathematical proof of correctness. Any proof failure blocks the next release.

Scope of verification — what is OUT of scope: Cross-component interactions (e.g., DSM coherence protocol interacting with hardware isolation boundaries simultaneously) are beyond current tool capabilities. Individual components are verified against their specifications; the composition is validated by integration testing and fuzzing. This is an honest limitation — complete whole-system verification remains a research problem.

Unsafe Code Verification Strategy:

Rust's unsafe blocks are the primary verification target — they are where memory safety invariants must be manually upheld. The strategy:

  1. Verus for ownership and invariant proofs: verify that unsafe code upholds the safety contract documented in its // SAFETY: comment. Verus can reason about pointer validity, aliasing, and lifetime guarantees.

  2. Kani for model-checking unsafe code paths: bounded model checking explores all possible inputs to unsafe functions up to a configurable bound, catching edge cases that specifications might miss.

  3. Wrap unsafe in safe abstractions: every unsafe block is encapsulated in a safe function with a verified specification. Callers never touch unsafe directly. The safe wrapper's specification becomes the verification boundary.

Verification Complexity by Component:

Based on published Verus effort data and component characteristics:

Component Relative Complexity Rationale
Capability system (Section 11) Low Small state machine, clear invariants
IPC ring buffer (Section 8) Low Single producer-consumer, bounded
Page table management (Section 12) High Many edge cases, arch-specific
CBS bandwidth server (Section 15) Medium Well-studied algorithm
DSM coherence (Section 47.5) High Distributed protocol, concurrent access

Page table management and DSM coherence are the hardest verification targets due to arch-specific code paths and distributed state. The capability system and IPC ring buffer are the easiest starting points for building verification expertise.

50.5 Performance Impact

Literally zero. Verification is compile-time. Verus specifications are erased from the binary. The verified code is identical to the unverified code at runtime.

The only cost is developer time writing specifications. But this pays for itself by eliminating bugs that would otherwise require debugging, CVE patches, and emergency releases.


51. Safe Kernel Extensibility

51.1 The Paradigm

The most important OS innovation of the last decade is eBPF: user-injected verified code in kernel hot paths. But eBPF is limited by being bolted onto a C kernel with a conservative bytecode verifier.

ISLE can generalize this: every kernel policy is a safe, hot-swappable module.

Distinction from eBPF (Section 20.4): eBPF provides Linux-compatible user-to-kernel hooks for tracing, networking, and security — it serves the Linux ecosystem. Policy modules provide kernel-internal mechanism/policy separation via KABI vtables — they serve kernel evolution. Both coexist; they address different extensibility needs.

Current KABI model (Section 5):
  Drivers implement KABI vtables for device interaction.
  Drivers are hot-swappable (crash recovery, Section 9).
  Drivers run in isolation domains.

Generalized KABI model (this proposal):
  POLICIES also implement KABI vtables.
  Policies are hot-swappable (same mechanism as drivers).
  Policies run in isolation domains (same as Tier 1 drivers).

  The kernel provides MECHANISMS (scheduling, page tables, memory allocation).
  POLICY MODULES provide DECISIONS (which process runs next,
  which page to evict, how to route I/O).

51.2 Extensible Policy Points

// isle-core/src/policy/mod.rs

/// Policy points where the kernel delegates decisions to a module.
/// Each policy point has a default built-in implementation.
/// Custom modules can replace the default at runtime.

/// CPU scheduling policy.
///
/// Policy modules receive a `SchedPolicyContext` snapshot (§51.2.1), NOT a direct
/// reference to the locked runqueue. The snapshot is captured by isle-core under
/// the runqueue lock before the domain switch, ensuring consistency without
/// exposing internal kernel data structures across the trust boundary.
pub trait SchedPolicy: Send + Sync {
    /// Pick the next task to run on this CPU.
    fn pick_next_task(&self, cpu: CpuId, ctx: &SchedPolicyContext) -> Option<TaskId>;
    /// A task has become runnable. Decide where to enqueue it.
    fn enqueue_task(&self, task: TaskId, flags: EnqueueFlags);
    /// A task has yielded or exhausted its timeslice.
    fn task_tick(&self, task: TaskId, cpu: CpuId);
    /// Load balancing decision: should we migrate tasks between CPUs?
    fn balance_load(&self, this_cpu: CpuId, busiest_cpu: CpuId) -> MigrateDecision;
}

/// Page replacement policy (which pages to evict under memory pressure).
pub trait PagePolicy: Send + Sync {
    /// Select pages to evict from this zone.
    /// Returns results via a caller-provided fixed-capacity buffer (ArrayVec)
    /// since nr_to_scan is bounded by the zone scan batch size. Policy modules
    /// must not heap-allocate on the eviction hot path.
    fn select_victims(&self, zone: &Zone, nr_to_scan: u32, out: &mut ArrayVec<PageHandle, MAX_SCAN_BATCH>);
    /// Should this page be promoted to a higher tier (active list, huge page)?
    fn should_promote(&self, page: &PageHandle) -> bool;
    /// Migration decision: should this page move to a different NUMA node?
    fn migration_advice(&self, page: &PageHandle, current_node: u8) -> MigrateAdvice;
}

/// I/O scheduling policy (ordering of block I/O requests).
pub trait IoSchedPolicy: Send + Sync {
    /// Submit a new I/O request. Return its priority score.
    fn submit(&self, req: &IoRequest) -> IoScore;
    /// Pick the next I/O request to dispatch to the device.
    fn dispatch(&self, queue: &IoQueue) -> Option<IoRequestId>;
    /// A request has completed. Update internal state.
    fn complete(&self, req: &IoRequest, latency_ns: u64);
}

/// Network classification policy (packet prioritization, QoS).
pub trait NetClassPolicy: Send + Sync {
    /// Classify an incoming packet (assign priority, mark, queue).
    fn classify_rx(&self, packet: &PacketHeader) -> NetClass;
    /// Classify an outgoing packet.
    fn classify_tx(&self, packet: &PacketHeader) -> NetClass;
}

/// Memory tiering policy (which tier to place pages in).
pub trait TierPolicy: Send + Sync {
    /// Where should a newly allocated page go?
    fn initial_placement(&self, process: ProcessId, flags: AllocFlags) -> TierId;
    /// A page has been idle for N ticks. Should it be demoted?
    fn demotion_advice(&self, page: &PageHandle, idle_ticks: u32) -> TierDecision;
    /// A remote node has available memory. Should we use it?
    fn remote_tier_advice(&self, node_id: NodeId, available_bytes: u64) -> bool;
}

51.2.1 Policy Module Trust Boundary

Memory access scope: When a policy module runs in its own isolation domain, the kernel maps into that domain (read-only): - Run queue metadata (task count, utilization, per-CPU load) - Per-task scheduling metadata (priority, PELT state, cgroup membership) - System-wide metrics (total CPU count, NUMA topology, frequency domains)

The module CANNOT access: process memory, page contents, file data, network buffers, capability tables, or other modules' state. A rogue pick_next_task cannot scan process memory — hardware domain isolation prevents it.

Locking model: The kernel calls policy module functions with no cross-domain locks held. Per-CPU scheduler state (the runqueue) is locked by the caller; the policy module receives a read-only snapshot of the runqueue state via the SchedPolicyContext argument, not direct access to the locked runqueue. This prevents TOCTOU races: the snapshot is consistent because it is captured under the runqueue lock before the domain switch. The module manages its own internal synchronization (spinlocks, per-CPU data, RCU-like patterns). If the module deadlocks internally, the domain watchdog (timer-based, ~10ms timeout) detects the stuck call and triggers crash recovery — revert to built-in default policy, reload module.

Stateful modules: Traits require Send + Sync, but modules need mutable state (counters, queues, learned parameters). The module owns its state and provides interior mutability via its own locks. The kernel does not hold locks on the module's behalf — the module is a self-contained unit.

51.2.2 Side-Channel Mitigations

Domain isolation prevents direct memory reads across domain boundaries, but policy modules run in Ring 0 and share hardware resources with the core kernel. This opens side-channel vectors that domain isolation alone does not address.

Threat model: An untrusted or experimental module running in its own isolation domain could exploit: 1. Shared-cache timing attacks (L1/L2/LLC) — measure cache line eviction timing to infer kernel memory access patterns. 2. Speculative execution side-channels (Spectre v1 bounds check bypass) — trick the CPU into speculatively reading kernel data across the isolation domain boundary. 3. Timing observation — use high-resolution timers (rdtsc, cycle counters) to measure the duration of kernel operations and infer internal state.

Mitigations:

  • Cache partitioning: Intel CAT (Cache Allocation Technology) / ARM MPAM (Memory System Resource Partitioning and Monitoring) partitions LLC ways so that an untrusted module's cache allocation does not overlap with the core kernel's. Configured per isolation domain at module load time. On architectures without hardware cache partitioning, cache flushing on domain transitions provides a weaker but functional defense.

  • Timer resolution reduction: On AArch64, CNTKCTL_EL1.EL0PCTEN traps EL0 cycle counter reads, allowing the kernel to return a coarsened value. On x86, policy modules run in Ring 0, where rdtsc executes unconditionally regardless of CR4.TSD (the Intel SDM specifies that CR4.TSD=1 only traps rdtsc at CPL>0, not CPL=0). Ring 0 code therefore has full rdtsc access. The side-channel mitigation for Ring 0 policy modules on x86 relies on Intel CAT (LLC partitioning, described above) and cache flushing on domain transitions — not on timer coarsening. This is a deliberate acknowledgment that Ring 0 untrusted modules have the same timing access as any Ring 0 code; cache partitioning and flushing are the effective mitigations at this privilege level. Recommendation: policy modules should use the kernel's monotonic clock abstraction (ktime_get_ns() equivalent) rather than raw rdtsc / cycle counter reads, unless high-precision timing is explicitly required and the module is production-vetted (trusted). The kernel's time API provides sufficient resolution for scheduling and power decisions (~1ns on modern hardware) while maintaining a single auditable timing interface. Untrusted modules that bypass the time API and use raw rdtsc directly can serve as timing oracles for side-channel attacks; code review should flag such usage.

  • Constant-time helpers: The kernel provides constant-time comparison and lookup functions for any data that crosses the domain boundary into module-readable memory. This prevents modules from using timing differences to distinguish data values.

  • Spectre v1 barriers: All kernel→module data handoff uses lfence (x86) / csdb (ARM) speculation barriers. Module-provided indices into kernel arrays are bounds-checked with an array_index_nospec equivalent (index masking) before use.

Residual risk: Production-vetted modules (signed, running in the Core isolation domain) face the same side-channel exposure as any Ring 0 code — this is acceptable since they are fully trusted. Side-channel mitigations apply only to untrusted/experimental modules running in isolation domains. This is a deliberate trade-off: production modules get zero overhead, experimental modules get strong isolation at a small performance cost.

51.3 Module Lifecycle

Policy module lifecycle (same as driver lifecycle, Section 9):

1. Module binary is compiled Rust (same toolchain as kernel).
   Implements one or more policy traits via KABI vtable.
   Signed with driver signature mechanism (Section 22.4).
   Vtable uses same versioning as driver KABI: vtable_size field +
   InterfaceVersion check. A kernel upgrade that adds new methods to
   SchedPolicy extends the vtable — old modules still work (new methods
   fall back to built-in defaults based on vtable_size).

2. Module is loaded at runtime:
   echo "sched_ml_aware" > /sys/kernel/isle/policy/scheduler/active

3. Kernel:
   a. Verifies module signature.
   a2. Extends TPM PCR (or TDX RTMR) with module hash. For confidential
       computing attestation (§26), the loaded policy module is part of the
       Trusted Computing Base and must be measured.
   b. Allocates isolation domain for the module (if untrusted/experimental).
      Production-vetted modules (signed by kernel vendor, pre-verified)
      run in the Core isolation domain — zero domain transition overhead.
   c. Loads module code into isolated memory region.
   d. KABI vtable exchange (module provides policy vtable).
   e. Atomically swaps old policy for new policy.
   f. Old policy module can be unloaded.

4. Module crash:
   a. Domain fault trapped by kernel.
   b. Revert to built-in default policy (immediate, no interruption).
   c. Reload module if desired.
   d. Total disruption: zero. Built-in default handles the gap.

5. Module hot-swap:
   echo "sched_cfs_isle" > /sys/kernel/isle/policy/scheduler/active
   → Atomic swap to new policy. No interruption.

51.4 Relationship to eBPF

eBPF compatibility is maintained through isle-compat. Existing eBPF programs (XDP, tc, kprobes, tracepoints) work via the BPF syscall. Policy modules are a superset — they can do everything eBPF can do plus:

  • Full Rust expressiveness (loops, recursion, complex data structures)
  • Persistent mutable state (eBPF maps are limited)
  • Domain isolation instead of bytecode verifier (more flexible, same safety)
  • Crash recovery (eBPF programs can't crash; policy modules can, and are reloaded)
                        eBPF (Linux compat)      Policy Modules (ISLE)
Safety mechanism:       Bytecode verifier         Rust type system + domain isolation
Language:               BPF bytecode (limited)    Rust (full language)
State:                  BPF maps (key-value)      Any Rust data structure
Crash behavior:         Cannot crash              Crash → reload, default resumes
Hot-swap:               Per-program               Per-policy-point
Integration depth:      Hook points only          Full vtable interface

51.5 Linux Compatibility

sched_ext (Linux 6.12+) allows user-defined BPF scheduling policies. ISLE supports this through isle-compat: - sched_ext BPF programs load via the standard bpf() syscall - They run in the BPF compatibility layer - Performance and behavior identical to Linux sched_ext

Policy modules are an additional, ISLE-specific mechanism. Applications unaware of them see standard scheduling behavior.

Module Observability:

Policy modules emit structured tracepoints for every decision:

  • isle_tp_stable_policy_decision: emitted on each pick_next_task, select_victims, dispatch call. Fields: module name, decision type, chosen entity, alternatives considered, decision latency.
  • isle_tp_stable_policy_audit: decision audit log for compliance. Records which module made which resource allocation decision, enabling post-hoc analysis.
  • A/B comparison mode: two policy modules can run simultaneously — one active (making real decisions) and one shadow (receiving the same inputs, logging what it would have decided). Compare via policy.comparison_log in sysfs. This enables safe evaluation of new policies before activation.

51.6 Performance Impact

Indirect function call via vtable pointer: ~1-2ns (branch predictor handles it). Linux already uses the same pattern (sched_class->pick_next_task is a function pointer). Same cost as Linux.

Default (production-vetted modules): modules signed by the kernel vendor and pre-verified run in the Core isolation domain. Zero domain transition overhead. Same cost as Linux sched_class function pointer dispatch.

Untrusted/experimental modules: run in their own isolation domain. Add one domain register switch (WRPKRU on x86, POR_EL0+ISB on AArch64, DACR on ARMv7) instruction (~23 cycles per Section 3) per policy call. Each call crosses the domain boundary twice (enter + exit), costing 2 × 23 = 46 cycles. For scheduling: called once per context switch (~200 cycles). Adding 46 cycles to 200 = ~23% overhead on the context switch micro-path. This is the cost of sandbox isolation for unvetted code. Acceptable for development and experimentation. The module graduates to Core isolation domain after vetting.

(Note: WRPKRU latency varies by microarchitecture — measured at 11 cycles on Alder Lake, 23 cycles on Skylake, and up to 260 cycles on some Atom cores. The 23-cycle figure used throughout this section reflects Skylake-class server parts; overhead on other microarchitectures scales proportionally. The worst case (Atom, 260 cycles) would increase the domain-transition overhead by ~11x, but Atom-class cores are not a primary ISLE server target.)


52. Live Kernel Evolution

52.1 The Theseus Model

Theseus OS (Rice University, 2020) demonstrated that kernel components can be individually replaced at runtime without rebooting, by making state ownership explicit and granular.

ISLE already does this for drivers (Section 9 crash recovery). This section extends it to core kernel components.

52.2 Design: Explicit State Ownership Graph

// isle-core/src/evolution/mod.rs

/// Every kernel component declares its state explicitly.
/// This enables:
///   1. Live replacement: old component's state is migrated to new component.
///   2. Crash recovery: component's state can be reconstructed from invariants.
///   3. State inspection: debugging and observability.

/// Trait that every replaceable kernel component implements.
pub trait EvolvableComponent {
    /// Component's serializable state.
    /// Must capture ALL mutable state that persists across calls.
    type State: Serialize + Deserialize;

    /// Export current state for migration to a new version.
    fn export_state(&self) -> Self::State;

    /// Initialize from migrated state (for live replacement).
    fn import_state(state: Self::State) -> Result<Self, MigrationError>
    where Self: Sized;

    /// Initialize fresh (for first boot or after state loss).
    fn initialize_fresh(config: &KernelConfig) -> Self
    where Self: Sized;

    /// Version of this component's state format.
    /// Migration rule: v(N) can import v(N-1) state ONLY.
    /// For larger jumps (v1 → v5): chained migration through intermediates
    /// (v1 → v2 → v3 → v4 → v5). Each version carries ONE migration
    /// function from the immediately prior version. The chain runs
    /// during import_state() before the atomic swap.
    fn state_version(&self) -> u32;
}

Chain length bound: To prevent unbounded migration chains, the maximum chain length is 8 intermediate versions. A component at version v(K) can be live-evolved to at most version v(K+8) in a single operation. Larger version jumps require: (a) A direct v(K)→v(K+N) migration function registered by the new component (the component author provides a migration path that skips intermediates), or (b) Multiple sequential live evolutions (v(K)→v(K+8)→v(K+16)→...), each of which is a separate atomic operation with its own rollback capability. The 8-version limit bounds the worst-case migration time to ~8× the single-step migration cost. If a chained migration exceeds 500 ms total elapsed time, the evolution is aborted and the old component continues running. This timeout is configurable via evolution.max_chain_time_ms.

State Serialization Format:

/// Serialized component state for live replacement.
pub struct ComponentState {
    /// Component identifier (e.g., "scheduler", "page_replacement").
    pub component_id: &'static str,
    /// State format version (matches EvolvableComponent::state_version).
    pub version: u32,
    /// Serialized state data (component-owned schema).
    /// Allocated from the kernel heap via `alloc::vec::Vec` — this is acceptable
    /// because state export/import runs only during live replacement (rare, cold
    /// path, well after the heap allocator is initialized). State sizes are
    /// bounded per component (see §52.4 table).
    pub data: Vec<u8>,
    /// CRC32C of all preceding fields, using hardware acceleration
    /// (SSE4.2 `crc32` on x86, ARMv8 CRC instructions).
    ///
    /// **Checksum**: CRC32C provides adequate 32-bit error detection for this
    /// small, cold-path structure. A cryptographic hash is unnecessary here —
    /// state integrity against malicious tampering is enforced by the evolution
    /// framework's capability checks and signature verification (Section 52.4),
    /// not by this checksum.
    pub checksum: u32,
}

Each component owns its serialization schema. The kernel provides StateSerializer helpers for common patterns (serialize BTreeMap, serialize per-CPU arrays, serialize LRU lists) but does not impose a format. Components choose what to serialize and how — the contract is that import_state(export_state()) produces an equivalent component.

52.3 Component Replacement Flow

Live kernel component replacement (e.g., new scheduler algorithm):

Phase A — Preparation (runs concurrently with normal operation, NOT stop-the-world):
  1. New component binary loaded (same mechanism as policy module, §51).
  2. Old component: export_state() → serialized state.
     This may walk large data structures (all run queues, LRU lists, etc.).
     Time: potentially milliseconds for complex components.
     Normal operation continues during this phase — the old component
     is still active and handling requests.
  3. New component: import_state(serialized_state) → initialized.

Phase A' — Quiescence (bounded, runs before the atomic swap):
  Before the atomic swap, the old component enters a **quiescence phase**: all
  in-flight operations are allowed to complete (with a bounded deadline), and new
  operations are queued. The quiescence deadline is configurable per component type
  (default: 10ms for scheduler, 50ms for page replacement). If the deadline expires
  before all in-flight operations drain, the replacement is aborted and the old
  component resumes normal operation without disruption.

  **Operation interception mechanism**: At Phase A' entry, a per-component
  `quiescing: AtomicBool` flag is set to `true`. The vtable entry trampoline checks
  this flag before dispatching each call. When `quiescing` is `true`, the trampoline
  appends the operation descriptor (a serialized `PendingOp` containing the method ID
  and argument blob) to a bounded `pending_ops` queue instead of invoking the old
  component. This interception is lock-free (the queue is a pre-allocated MPSC ring
  buffer). The vtable pointer itself is not yet swapped — interception happens at the
  trampoline level, not the pointer level.

  **Queued operation handling**: Operations that arrive during Phase A' are appended
  to `pending_ops` via the interception mechanism above. If `pending_ops` reaches
  capacity (default: 1024 entries), the quiescence deadline is extended by up to
  100ms. If the deadline expires and in-flight operations have still not drained,
  the evolution is aborted: `quiescing` is set to `false`, the trampoline resumes
  normal dispatch, and the old component resumes without disruption.

  **State re-export**: After in-flight operations drain, the old component's state
  is re-exported (`export_state()` on the now-quiesced component). This re-export
  does NOT capture `pending_ops` — the queue is transferred separately in Phase B.

Phase B — Atomic swap (stop-the-world, ~1-10 μs):
  4. All CPUs briefly hold (IPI to stop-the-world).
  5. The `pending_ops` queue pointer is transferred to the new component by
     copying the ring buffer head/tail pointers. This is O(1) — no data copying,
     just pointer assignment. Operations that arrived between the Phase A'
     re-export and the IPI are captured because the interception trampoline
     continues appending to `pending_ops` until the IPI fires.
  6. Old component's vtable pointer replaced with new component's vtable.
  7. Interrupt handlers redirected. `quiescing` flag cleared.
  8. CPUs released. New component is now active.
  Only the pointer swap + queue transfer is stop-the-world. No data structure
  walking. The queue transfer (step 5) adds ~100ns to the stop-the-world window.

Phase C — Activation and cleanup:
  Phase C1 — New component activation:
    9. New component drains `pending_ops` queue before accepting new operations.
       Each pending op is replayed through the new component's vtable.
  Phase C2 — Deferred cleanup (after watchdog window):
    10. The old component is NOT immediately freed. It is frozen (no new calls)
        but its memory is retained for the Post-Swap Watchdog window (5 seconds,
        see below). If the watchdog triggers a revert, the old component is
        reactivated from this frozen state.
    11. After the watchdog window expires without revert, the old component is
        unloaded and its memory freed.
  Total disruption: ~1-10 μs (the Phase B stop-the-world window only).

If import_state fails (incompatible version):
  → Abort replacement. Old component continues. No disruption.

If new component crashes after replacement:
  → Crash recovery (Section 9). Reload old component with initialize_fresh().
  → Component state lost, but system continues.

Post-Swap Watchdog:

After the atomic swap (Phase B), a 5-second watchdog timer starts. If the new component crashes or triggers a fault within this window, the kernel reverts to the old component using the RETAINED serialized state (from export_state() in Phase A), not initialize_fresh(). This preserves accumulated state (run queue weights, LRU ordering, learned parameters) across a failed swap attempt. Only if the retained state itself is corrupted does the kernel fall back to initialize_fresh().

Memory During Swap:

The dual-load approach (old + new component coexist during Phase A) requires sufficient memory for both. Typical component state sizes: scheduler ~64KB, page replacement ~128KB, I/O scheduler ~8KB per device. If insufficient memory is available for the new component's state, the swap returns ENOMEM and the old component continues unchanged. Maximum expected dual-load overhead: ~128KB for the scheduler (the largest replaceable component).

52.3.1 Export Symbol Contract

When a component is live-replaced, other components may depend on its exported symbols (vtable entries, public functions, constants). The following rules govern export compatibility during live evolution:

  1. Compatible exports required. The new version MUST export the same KABI vtable entries at compatible types (same layout, same semantics). If the new version changes an export's signature (different parameter types, different return type, different struct layout), the live evolution is rejected at load time during Phase A. The loader compares vtable sizes and entry signatures before proceeding to state export.

  2. Indirection-based resolution. Export addresses are resolved through the KABI vtable indirection table, not direct pointers. When the new version loads, the vtable pointer is atomically updated during Phase B (step 5). Dependent components never hold raw function pointers to the old version's code -- they dispatch through the vtable pointer, which is updated in the stop-the-world window. This is the same mechanism used for policy module vtable dispatch (Section 51).

  3. Removed exports rejected. If the new version removes a vtable entry (reduces vtable_size), the evolution is rejected unless no loaded component references the removed entry. The loader scans the dependency graph during Phase A to verify this. Adding new entries (increasing vtable_size) is always safe -- existing callers never reference entries beyond the size they were compiled against.

52.4 What Can Be Live-Replaced

Component Replaceable? State Size Notes
CPU scheduler Yes Per-CPU run queues, CBS servers (~64KB total) Policy module swap (§51) covers most cases
Page replacement Yes LRU lists, access counters (~128KB) Hot-swap eviction algorithm
I/O scheduler Yes Per-device queues (~8KB per device) Hot-swap I/O algorithm
Network classifier Yes Classification rules, flow tables (~256KB) Hot-swap QoS policy
Memory allocator No Buddy allocator state is the physical memory map Too fundamental to swap. Bugs caught by verification (§50).
Page table manager No Active page tables for all processes Same — too fundamental.
Capability system No Global capability table Security-critical — verified (§50), never replaced.
KABI dispatch No Vtable registry Infrastructure — stable by design.
Tier 1 drivers Yes (existing) Driver-internal state Crash recovery already handles this (Section 9).

The non-replaceable components (listed above) are verified via the techniques in Section 50. The replaceable components can be evolved independently via Section 52. Together: verified core + evolvable policy.

52.5 Performance Impact

Steady-state: zero. Between replacements, code paths are identical to a monolithic kernel. The EvolvableComponent trait adds no runtime code — it's a development contract.

During replacement: ~1-10 μs stop-the-world. Happens at most once per kernel update. Amortized over months of uptime: unmeasurable.


53. Intent-Based Resource Management

53.1 The Abstraction Gap

ISLE has all the mechanisms for smart resource management: - In-kernel inference engine (Section 45) for learned decisions - Per-device utilization tracking (Section 42) - Topology awareness (device registry, Section 7) - Power metering (Section 49 ) - Memory tier tracking (PageLocationTracker, Section 43) - Network fabric topology (Section 47.2)

What's missing is the abstraction that ties these together. Currently, resources are managed imperatively: "give me 4 cores and 16GB RAM." The alternative: declare goals, let the kernel optimize.

53.2 Design: Resource Intents

// isle-core/src/intent/mod.rs

/// A resource intent declares WHAT the workload needs,
/// not HOW to allocate resources.
#[repr(C)]
pub struct ResourceIntent {
    /// Target P99 SCHEDULING latency (nanoseconds).
    /// This is the time from task becoming runnable to task getting CPU.
    /// The kernel cannot measure application-level latency (it doesn't know
    /// what an "operation" is). This metric is scheduling + I/O completion
    /// latency — both kernel-observable.
    /// Kernel adjusts CPU priority, memory placement, I/O scheduling.
    /// 0 = no latency target (best-effort).
    pub target_latency_ns: u64,

    /// Target throughput (operations per second).
    /// Kernel adjusts CPU allocation, I/O queue depth, batch sizes.
    /// 0 = no throughput target (best-effort).
    pub target_ops_per_sec: u64,

    /// Availability requirement (basis points: 9999 = 99.99%).
    /// Kernel adjusts redundancy, crash recovery priority.
    /// 0 = no availability target.
    /// Used by: cgroup knob `intent.availability` (Section 53.3),
    /// crash recovery priority in Section 39 (higher availability_bp
    /// = faster restart, more aggressive health monitoring).
    pub availability_bp: u32,

    /// Power efficiency preference (0 = max performance, 100 = max efficiency).
    /// Kernel adjusts DVFS, core parking, accelerator clock.
    /// 50 = balanced (default).
    pub efficiency_preference: u32,

    /// Data locality hint: where does this workload's data live?
    /// Kernel uses this for NUMA placement and distributed scheduling.
    /// Used by: cgroup knob `intent.data_affinity` (Section 53.3),
    /// NUMA placement optimizer (Section 53.4 step 2b), and distributed
    /// scheduling in Section 47.
    pub data_affinity: DataAffinityHint,

    /// Struct layout version. Enables future extension without breaking binary
    /// compatibility: the kernel checks this field and interprets fields beyond
    /// the base layout only if version >= the version that introduced them.
    /// v1 = initial layout (this definition). Future versions extend into _reserved.
    pub version: u32,

    /// Reserved for future fields. New versions of ResourceIntent consume bytes
    /// from this region. Zero-initialized by callers; the kernel ignores
    /// non-zero bytes in positions it does not recognize for the given version.
    /// Sized to make the struct exactly 64 bytes (u64-aligned) with no implicit
    /// tail padding: 8+8+4+4+4+4+32 = 64.
    pub _reserved: [u8; 32],
}

#[repr(u32)]
pub enum DataAffinityHint {
    /// No preference. Kernel decides based on observation.
    Auto            = 0,
    /// Data is primarily local (disk-bound workload).
    Local           = 1,
    /// Data is distributed across nodes (distributed workload).
    Distributed     = 2,
    /// Data is on accelerators (GPU-bound workload).
    Accelerator     = 3,
}

53.3 Cgroup Integration

/sys/fs/cgroup/<group>/intent.latency_ns
    # Target P99 latency in nanoseconds.
    # "0" = no target (default, pure imperative mode).
    # "5000000" = target 5ms P99 latency.

/sys/fs/cgroup/<group>/intent.throughput
    # Target operations per second.
    # "0" = no target.

/sys/fs/cgroup/<group>/intent.efficiency
    # 0 = max performance, 100 = max efficiency, 50 = balanced.
    # Default: 50.

/sys/fs/cgroup/<group>/intent.availability
    # Availability target in basis points (0 = no target, 9999 = 99.99%).
    # Kernel adjusts crash recovery priority and health monitoring
    # frequency for drivers serving this cgroup. Higher values trigger
    # faster driver restart (Section 39) and redundant I/O path selection.
    # Default: 0 (no availability target).
    # Maps to: ResourceIntent.availability_bp

/sys/fs/cgroup/<group>/intent.data_affinity
    # Data locality hint for NUMA placement and distributed scheduling.
    # Values: "auto" (default), "local", "distributed", "accelerator"
    # "auto" = kernel observes memory access patterns and decides.
    # "local" = data is primarily on local storage (optimize for disk I/O).
    # "distributed" = data spans cluster nodes (optimize for network).
    # "accelerator" = data lives on accelerator memory (minimize transfers).
    # Maps to: ResourceIntent.data_affinity (DataAffinityHint enum)

/sys/fs/cgroup/<group>/intent.status
    # Read-only. Current intent satisfaction:
    #   latency_met: true
    #   latency_p99_actual_ns: 3200000
    #   throughput_met: true
    #   throughput_actual: 12500
    #   power_actual_mw: 95000
    #   adjustments_last_hour: 3

53.4 The Optimization Loop

Every ~1 second (configurable):

1. IntentOptimizer collects metrics:
   - Per-cgroup: actual latency (P50, P99), throughput, power
   - Per-device: utilization, temperature, power
   - Cluster-wide: node loads, memory pressure, network utilization

2. For each cgroup with intents:
   a. Is the intent being met?
      - latency_p99_actual <= target_latency_ns?
      - throughput_actual >= target_ops_per_sec?
   b. If not met → need more resources:
      - Increase CPU allocation (raise cpu.weight or cpu.guarantee)
      - Improve NUMA placement (migrate pages closer to running CPUs)
      - Increase accelerator allocation (raise accel.compute.guarantee)
      - Increase I/O priority
   c. If met with headroom → can release resources:
      - Reduce CPU allocation (lower cpu.weight)
      - Lower frequency (save power)
      - Free accelerator time for other workloads

3. Apply adjustments via existing cgroup knobs.
   Intent layer is an OPTIMIZER that writes to existing imperative knobs.
   It does NOT replace the imperative interface — it sits above it.

   Stability controls (prevent oscillation/hunting):
     - Hysteresis: don't adjust unless delta exceeds 10% of current value.
     - Minimum hold time: no changes within 5 seconds of last adjustment.
     - Damping: exponential backoff if last 3 adjustments didn't converge.
     - Max adjustment rate: at most ±20% change per optimization cycle.

4. Conflicting intents: when multiple cgroups declare intents that cannot
   all be satisfied simultaneously (insufficient resources):
     - Intents are BEST-EFFORT, not guarantees.
     - Priority follows existing cpu.weight / accel.compute.weight hierarchy.
     - Higher-weight cgroups get intent satisfaction first.
     - Unsatisfied intents are reported in intent.status (latency_met: false).
     - The optimizer does NOT starve low-priority cgroups — it respects
       existing cgroup min guarantees (cpu.min, memory.min).

5. Policy priority ordering (prevents conflicts between subsystems):
     Priority (highest to lowest):
       a. Hardware limits (thermal throttle, voltage limits) — immutable
       b. Admin-configured cgroup limits (cpu.max, power.max) — hard ceiling
       c. Power budget enforcement (§49) — watt cap
       d. Intent optimization (this section) — soft optimization
       e. EAS energy optimization (Section 14.5) — per-task core selection
     Each layer can only adjust WITHIN the ceiling set by the layer above.
     Power budget is a hard constraint; intents work within it. No oscillation.
     When power budgeting and EAS conflict (e.g., power budgeting throttles a
     CPU domain that EAS prefers), power budgeting takes precedence — the EAS
     migration is deferred until the power budget is satisfied. This may
     temporarily route tasks to less energy-efficient cores, but prevents
     thermal throttling and power supply overload, which are correctness
     constraints rather than optimization goals.

6. Log adjustments to /sys/kernel/isle/intent/adjustment_log
   for observability and debugging.

The in-kernel inference engine (Section 45) powers the optimization. The "Intent I/O Scheduler" and "Intent Page Prefetch" models (Section 45.5) are use cases of intent-based management.

Intent Admission Control:

Intents are advisory, not guaranteed. When a cgroup sets intent.latency_ns = 5000000 (5ms), the kernel attempts to meet it but does not reject the intent if resources are insufficient. Instead: - If the intent cannot be met: intent.status reports latency_met: false with the actual observed P99 latency. - Clamping: intent values are clamped to physically achievable bounds. An intent.latency_ns = 1 (1 nanosecond) is silently clamped to the system's minimum achievable scheduling latency (~10μs on a typical x86 system). - Contradictions (e.g., intent.latency_ns = 1000 with intent.efficiency = 100) are logged as warnings in intent.status with contradiction: latency_vs_efficiency. The optimizer prioritizes the latency target.

Intent Feedback:

/sys/fs/cgroup/<group>/intent.status
    # Read-only. Current intent satisfaction:
    #   latency_met: true|false
    #   latency_p99_actual_ns: <value>
    #   throughput_met: true|false
    #   throughput_actual: <value>
    #   power_actual_mw: <value>
    #   optimizer_action: <last action taken, e.g., "raised cpu.weight to 200">
    #   adjustments_last_hour: <count>
    #   contradiction: <none|description>

Multi-Tenant Isolation:

The cgroup hierarchy IS the authority for resource isolation. Intents operate within existing cgroup limits: - A child cgroup's intent cannot cause resource consumption exceeding the parent's cpu.max, memory.max, or power.max limits. - Parent limits are a hard ceiling. Intents are soft optimization within that ceiling. - Cross-tenant interference is prevented by existing cgroup isolation — the intent optimizer adjusts knobs for one cgroup without affecting other cgroups' guarantees (cpu.min, memory.min are respected).

53.5 Performance Impact

The optimization loop runs once per second as a background kernel thread. Each iteration reads per-cgroup metrics, runs the inference engine, and writes adjusted parameters. The total cost scales linearly with the number of cgroups that have active intents. For typical server deployments (10-50 active intent cgroups), the overhead is negligible (sub-millisecond per iteration).

The actual scheduling/allocation decisions use the same fast paths as before. Only the cgroup parameters change. Hot-path performance: identical to Linux.

53.6 Explainability Interface

Intent optimization (§53.4) reports whether intents are met and what adjustments were made, but administrators also need to understand why a specific performance target is not being met, what the system tried and rejected, and what action they could take to help. The explainability interface provides this deep diagnostic view.

sysfs interface (per-cgroup, read-only):

/sys/fs/cgroup/<group>/intent.explain
    bottleneck: cpu|memory|io|accelerator|power|network
    bottleneck_detail: "CPU saturated: 4/4 cores at 100%, cpu.max reached"
    adjustments_attempted: 5
    adjustments_rejected: 2
    rejected_reasons: ["cpu.max ceiling reached", "power budget §49 constraint"]
    recommendation: "Increase cpu.max from 400000 to 600000"
    conflicting_intents: ["cgroup:/prod/db has higher cpu.weight, consuming 3/4 cores"]

Each field is populated by the optimization loop (§53.4) at the end of each cycle. The bottleneck field identifies the single most constrained resource. The recommendation field suggests the smallest configuration change that would allow the intent to be met. The conflicting_intents field lists other cgroups whose intents are competing for the same resource.

Structured tracepoint: isle_tp_stable_intent_explain is emitted every optimization cycle for cgroups with unmet intents. Fields: cgroup path, intent type, target value, actual value, bottleneck type, attempted adjustments (count), rejection reasons (array). This enables perf / BPF-based monitoring of intent optimization across the system.

Adjustment history log: Per-cgroup ring buffer of the last 64 adjustments, exposed via:

/sys/fs/cgroup/<group>/intent.adjustment_history
    # Each entry:
    #   timestamp: 1708012345.123456
    #   parameter: cpu.max
    #   old_value: 400000
    #   new_value: 500000
    #   reason: "latency_p99 target 5ms, actual 8ms, cpu was bottleneck"
    #   effect: "latency_p99 dropped from 8ms to 4.2ms at next cycle"

The ring buffer is fixed-size (64 entries, ~8 KiB per cgroup) and wraps around. It provides a complete causal trail: what changed, why, and what happened as a result.

Integration with islectl: islectl intent explain <cgroup> provides a human-readable summary combining intent.status + intent.explain + intent.adjustment_history into a single diagnostic view. Example output:

$ islectl intent explain /prod/web
Intent: latency_p99 ≤ 5ms
Status: NOT MET (actual: 8.1ms)
Bottleneck: CPU (4/4 cores at 100%, cpu.max = 400000)
Recommendation: Increase cpu.max to 600000
Conflicting: /prod/db (cpu.weight=200, consuming 3/4 cores)
Last 3 adjustments:
  [12:01:05] cpu.weight 100→150 — effect: p99 9.3ms→8.5ms
  [12:01:06] io.weight  100→200 — effect: p99 8.5ms→8.3ms (not bottleneck)
  [12:01:07] cpu.weight 150→200 — rejected: power budget §49 constraint

54. Unified Compute Model

54.1 The Convergence Problem

The architecture currently treats CPU scheduling (Section 14) and accelerator scheduling (Section 42.2.4) as separate worlds:

World 1 — CPU Scheduler (Section 14):
  Input:     threads (instruction streams)
  Resources: CPU cores (P-core, E-core, RISC-V harts)
  Decision:  which core runs this thread?
  Cgroup:    cpu.max, cpu.weight

World 2 — Accelerator Scheduler (Section 42.2.4):
  Input:     command buffers (GPU kernels, inference requests)
  Resources: accelerator contexts (GPU CUs, NPU engines)
  Decision:  which context gets device time?
  Cgroup:    accel.compute.max, accel.compute.weight

These worlds share no abstraction. The kernel cannot answer: "given a fixed power budget and a matrix workload, is it more efficient to run on CPU-AMX, GPU, or NPU?" It cannot balance compute load across device types. It cannot make holistic energy decisions.

Meanwhile, the hardware is converging:

Trend Example Implication
CPU gains matrix ops Intel AMX (P-cores), ARM SME CPU can do what GPUs used to do
CPU+GPU share memory APU (AMD), Apple M-series, Grace Hopper NVLink-C2C No DMA copy between CPU↔GPU
Heterogeneous ISA within CPUs RISC-V: some harts have Vector, some don't (Section 14.5.9) "CPU with Vector" vs "GPU CU" is a matter of degree
CXL 3.0 shared memory Samsung CMM-H, Intel Ponte Vecchio + CXL Hardware-coherent memory shared by CPU and accelerator
On-die NPU Intel Meteor Lake NPU, Qualcomm Hexagon NPU is as close to CPU as an E-core

The conceptual leap from Section 14.5.9 (RISC-V harts with different ISA extensions) to "GPU CU as another compute unit type" is small. Both are compute resources with different capability profiles and power characteristics.

54.2 Design Principle: Overlay, Not Replacement

Critical constraint: This must work Day 1 as a Linux drop-in replacement. NVIDIA's proprietary userspace (libcuda, libnvidia-ml, cuDNN) runs unmodified. CUDA applications explicitly target the GPU — the kernel does NOT redirect them. AMD ROCm, Intel oneAPI, all work as-is.

The unified compute model is an advisory overlay on top of the existing separate schedulers:

                    ┌─────────────────────────────┐
                    │   Unified Compute Topology   │  ← NEW (advisory)
                    │   Multi-dimensional capacity │
                    │   Cross-device energy model  │
                    └──────┬──────────────┬────────┘
                           │              │
                    ┌──────▼──────┐ ┌─────▼──────┐
                    │CPU Scheduler│ │Accel Sched  │  ← UNCHANGED
                    │ CFS + EAS   │ │ CBS + Prio  │
                    │ (Section 14)│ │ (Section 42.2.4)   │
                    └─────────────┘ └─────────────┘
  • Existing schedulers continue to make execution decisions independently.
  • The unified layer provides topology, capacity, and energy data that both schedulers and userspace runtimes can consume.
  • No vendor must rewrite anything. Benefits accrue from kernel-side visibility.

54.3 Multi-Dimensional Compute Capacity

Section 14.5.1 defines CpuCapacity as a single scalar (0–1024). This works for heterogeneous CPUs because all cores execute the same type of work (general-purpose instructions) at different speeds.

Once accelerators enter the picture, capacity becomes a vector — different compute units excel at different workload types:

// isle-core/src/compute/capacity.rs

/// Multi-dimensional capacity profile for any compute unit.
///
/// Values are ABSOLUTE (device-intrinsic), not normalized to the system.
/// Each dimension is in hardware-specific units that do not change when
/// devices are hot-plugged. The kernel maintains a per-system max for
/// each dimension (updated lazily on device arrival/departure) and
/// normalizes to 0–1024 ONLY for sysfs display (compute.capacity_normalized).
///
/// Why absolute: normalizing internally means hot-plugging a faster GPU
/// would silently change every other device's capacity values, breaking
/// comparisons across snapshots and racing with concurrent readers.
///
/// Used for:
///   - Power budgeting: informed cross-device throttling (§49)
///   - Intent-based management: workload-to-device advisory (§53)
///   - Userspace runtime hints: exposed via sysfs for OpenCL/SYCL
///
/// NOT used for: actual scheduling decisions (those remain in
/// CPU scheduler and AccelScheduler respectively).
pub struct ComputeCapacityProfile {
    /// Scalar integer throughput (million instructions per second, MIPS-equivalent).
    /// CPU P-core: ~50000. GPU CU: ~200. NPU: 0.
    pub scalar: u32,

    /// Vector/SIMD throughput (GFLOPS single-precision equivalent).
    /// GPU CU: ~2000. CPU P-core with AVX-512: ~300. NPU: 0.
    pub vector: u32,

    /// Matrix throughput (TOPS, tera-operations per second for matmul).
    /// NPU: ~40. GPU tensor core: ~300. CPU AMX: ~5. CPU scalar: ~0.
    pub matrix: u32,

    /// Memory bandwidth (GB/s).
    /// GPU HBM3: ~3000. CPU DDR5: ~100. NPU on-chip SRAM: ~500.
    pub memory_bw: u32,

    /// Launch overhead (inverse, microseconds to first useful work).
    /// CPU: 1 (thread wakeup ~1μs). GPU: 30 (kernel launch ~30μs).
    /// NPU: 200 (model load + inference setup).
    /// Lower = better. Determines crossover: small tasks favor CPU.
    pub launch_overhead_us: u32,
}

Example profiles on a Grace Hopper system (ARM CPU + H100 GPU):

CPU Grace core (ARM Neoverse):
  scalar=45000  vector=300  matrix=2    memory_bw=100  launch_overhead_us=1

GPU H100 SM:
  scalar=200    vector=2000 matrix=300  memory_bw=3000 launch_overhead_us=30

Intel Alder Lake system with discrete GPU and NPU:
CPU P-core:   scalar=50000  vector=300  matrix=5   memory_bw=80  launch_overhead_us=1
CPU E-core:   scalar=25000  vector=150  matrix=2   memory_bw=80  launch_overhead_us=1
iGPU EU:      scalar=100    vector=600  matrix=10  memory_bw=50  launch_overhead_us=20
Discrete GPU: scalar=200    vector=2000 matrix=250 memory_bw=900 launch_overhead_us=30
NPU:          scalar=0      vector=0    matrix=40  memory_bw=50  launch_overhead_us=200

RISC-V SoC with heterogeneous harts + accelerators:
Hart (RV64GC):   scalar=15000  vector=0    matrix=0   memory_bw=30  launch_overhead_us=1
Hart (RV64GCV):  scalar=15000  vector=200  matrix=1   memory_bw=30  launch_overhead_us=1
Custom ML hart:  scalar=3000   vector=100  matrix=20  memory_bw=60  launch_overhead_us=5
Attached NPU:    scalar=0      vector=0    matrix=10  memory_bw=40  launch_overhead_us=200

(These are illustrative values — actual values are runtime-discovered from hardware
capability queries. Actual values are populated from driver-reported specs at device
registration time. Sysfs normalizes to 0-1024 per dimension for display, where
best-in-system = 1024.)

Key property: CPU cores already have entries (derived from CpuCapacity in Section 14.5.1). Accelerators get profiles from AccelBase get_utilization (Section 42.2.2) extended with a get_capacity_profile vtable entry. This is a minor KABI extension — one new function pointer that returns static data.

54.4 Unified Compute Topology

The device registry (Section 7) already models all devices in one tree. The unified compute layer adds a compute view that flattens this into a map of compute units with their capabilities, power profiles, and memory domains:

// isle-core/src/compute/topology.rs

/// A compute unit in the unified topology.
/// Can be a CPU core, GPU SM, NPU engine, DSP core, etc.
pub struct ComputeUnit {
    /// Device registry node ID.
    pub device_id: DeviceNodeId,

    /// What kind of compute unit this is.
    pub unit_type: ComputeUnitType,

    /// Multi-dimensional capacity profile.
    pub capacity: ComputeCapacityProfile,

    /// Which memory domain is local to this compute unit?
    /// CPU cores → system RAM NUMA node.
    /// GPU SMs → VRAM NUMA node (Section 43).
    /// APU GPU → same NUMA node as CPU (shared memory).
    pub memory_domain: MemoryDomainId,

    /// Is memory shared with CPU without DMA copy?
    /// true for: APU, Apple M-series, Grace Hopper, CXL-attached accelerator.
    /// false for: discrete PCIe GPU (data must be explicitly transferred).
    pub memory_unified_with_cpu: bool,

    /// Energy model: OPP table (same format as Section 14.5.2).
    /// GPU OPPs come from the driver via AccelBase get_utilization/set_performance_level.
    pub energy_model: Option<EnergyModelRef>,

    /// Current utilization (0–1024), updated periodically.
    /// CPU: from PELT (Section 14.5.4).
    /// Accelerator: from AccelBase get_utilization (Section 42.2.2).
    pub utilization: AtomicU32,
}

#[repr(u32)]
pub enum ComputeUnitType {
    /// General-purpose CPU core. Managed by CPU scheduler (Section 14).
    CpuCore         = 0,
    /// GPU compute unit (SM/CU/EU). Managed by AccelScheduler (Section 42.2.4).
    GpuCompute      = 1,
    /// Neural processing unit. Managed by AccelScheduler.
    NpuEngine       = 2,
    /// Digital signal processor. Managed by AccelScheduler.
    DspCore         = 3,
    /// FPGA reconfigurable logic. Managed by AccelScheduler.
    FpgaSlot        = 4,
    /// Computational storage processor (§33). Managed by AccelScheduler.
    CsdProcessor    = 5,
}

Population: The topology is built at boot and updated on hot-plug:

1. CPU cores:    discovered by existing CPU topology (Section 14.5.10).
                 ComputeCapacityProfile derived from CpuCapacity + IsaCapabilities.

2. Accelerators: discovered by device registry (Section 7) when AccelBase driver loads.
                 Driver provides ComputeCapacityProfile via get_capacity_profile().
                 If driver doesn't implement it (legacy, compat):
                   → kernel estimates from AccelDeviceClass + get_utilization().
                   → NVIDIA compat driver: AccelDeviceClass::GpuCompute,
                     bandwidth/utilization from nvidia-smi equivalent queries.
                   → No NVIDIA code change needed. Kernel reads existing telemetry.

54.5 Cross-Device Energy Optimization

The power budgeting system (§49) currently reads power per domain (CPU package, DRAM, Accelerator) and throttles independently. With unified topology, it can make informed cross-device decisions:

Current (independent throttling):
  Container exceeds power.max.
  → Throttle CPU (reduce frequency).
  → Throttle GPU (reduce clock).
  → Both throttled equally. Dumb.

With unified compute awareness:
  Container exceeds power.max.
  → Kernel reads workload profile:
      80% of compute is matrix ops (GPU-bound).
      20% is scalar (CPU, mostly waiting for GPU).
  → Informed decision:
      Keep GPU at high clock (it's doing the useful work).
      Aggressively throttle CPU (it's mostly idle-waiting anyway).
      Save more power with less performance loss.

This requires no change to throttle mechanisms — just better information for PowerBudgetEnforcer (§49.4) to decide WHICH domain to throttle.

// Extension to isle-core/src/power/budget.rs

impl PowerBudgetEnforcer {
    /// When a cgroup exceeds its power budget, decide which domain to throttle.
    /// Uses unified compute topology to understand where useful work is happening.
    fn select_throttle_target(
        &self,
        cgroup: CgroupId,
        excess_mw: u32,
    ) -> ArrayVec<ThrottleAction, MAX_POWER_DOMAINS> {
        let topology = unified_compute_topology();
        let workload = cgroup_workload_profile(cgroup);

        // Score each domain by "usefulness" = domain's contribution to
        // the cgroup's primary workload type.
        // Throttle the LEAST useful domain first.
        // domains is bounded by MAX_POWER_DOMAINS (no heap allocation).
        let mut domains: ArrayVec<_, MAX_POWER_DOMAINS> = self.domains.iter()
            .filter(|d| d.cgroup_attribution(cgroup) > 0)
            .map(|d| (d, workload.usefulness_score(d, &topology)))
            .collect();

        // Sort: least useful first (throttle first).
        domains.sort_by_key(|(_, score)| *score);

        // Apply throttle actions starting from least useful domain
        // until excess_mw is recovered.
        self.build_throttle_plan(&domains, excess_mw)
    }
}

54.6 Workload Profile Classification

Intel Thread Director (Section 14.5.6) classifies CPU workloads by instruction mix. Generalize this to a system-wide workload profile that covers all compute domains:

// isle-core/src/compute/classify.rs

/// System-wide workload classification for a cgroup or process.
/// Updated periodically (~1 second) from multiple sources.
///
/// Fractions are fixed-point: 0–1000 representing 0.0%–100.0%.
/// No floating-point in kernel data structures (kernel does not use FPU).
///
/// Invariant: `scalar_fraction + vector_fraction + matrix_fraction <= 1000`.
/// These three fields partition the compute demand; their sum must not exceed 1000.
/// `accel_wait_fraction` and `memory_bound_fraction` are independent blocking-time
/// metrics and are not included in the partition sum.
/// Values that would push the sum above 1000 are saturated at 1000 by the
/// classifier before storing; the dominant fraction absorbs any excess.
pub struct WorkloadProfile {
    /// Fraction of compute demand that is scalar (0–1000).
    /// Source: PELT utilization on CPU cores + ITD hints.
    pub scalar_fraction: u32,

    /// Fraction of compute demand that is vector/SIMD (0–1000).
    /// Source: hardware performance counters (SIMD instruction retired).
    pub vector_fraction: u32,

    /// Fraction of compute demand that is matrix/tensor (0–1000).
    /// Source: AMX/SME counters on CPU, utilization reports from AccelBase.
    pub matrix_fraction: u32,

    /// Fraction of time spent waiting for accelerator completion (0–1000).
    /// Source: CPU scheduler (time in interruptible sleep waiting for accel).
    /// High value = GPU-bound workload.
    pub accel_wait_fraction: u32,

    /// Fraction of time spent waiting for memory/I/O (0–1000).
    /// Source: hardware counters (LLC miss rate, stall cycles).
    pub memory_bound_fraction: u32,

    /// Dominant compute domain for this workload.
    /// Derived from the fractions above.
    pub dominant_domain: ComputeUnitType,
}

Where this data is used:

  1. Power budgeting (§49): Which domain to throttle (§54.5 above).
  2. Intent-based management (§53): When intent.efficiency = 80 (prefer efficiency), and the workload is matrix-dominant, the optimizer suggests moving from GPU to NPU (lower power per matrix op).
  3. Userspace runtimes: Exposed via sysfs for consumption by OpenCL/SYCL/oneAPI runtimes that make device selection decisions.
/sys/fs/cgroup/<group>/compute.profile
    # Read-only. Current workload classification (0-1000 = 0.0%-100.0%):
    #   scalar: 150
    #   vector: 50
    #   matrix: 700
    #   accel_wait: 600
    #   memory_bound: 100
    #   dominant: gpu_compute

/sys/kernel/isle/compute/topology
    # Read-only. JSON: all compute units with capacity profiles.
    # Consumed by userspace runtimes for device selection.

/sys/kernel/isle/compute/unit/<device_id>/capacity
    # Read-only. Per-unit: "scalar=1024 vector=300 matrix=150 ..."

54.7 Unified Cgroup Compute Budget (Optional)

An optional cgroup knob that expresses total compute need abstractly, leaving device selection to the kernel:

/sys/fs/cgroup/<group>/compute.weight
    # Proportional share of total system compute (across ALL devices).
    # Default: 0 (disabled — use existing cpu.weight + accel.compute.weight).
    # When set: kernel adjusts cpu.weight and accel.compute.weight internally
    # to optimize for the cgroup's workload profile.
    #
    # Example: two cgroups, both compute.weight=100.
    # Cgroup A is GPU-bound → kernel gives A more GPU time, less CPU.
    # Cgroup B is CPU-bound → kernel gives B more CPU time, less GPU.
    # Both get "equal compute" in terms of actual useful work done.

Implementation: compute.weight is an orchestration knob. The kernel's intent optimizer (§53) reads compute.weight + compute.profile and adjusts the existing per-domain knobs (cpu.weight, accel.compute.weight) every ~1 second. No new scheduling fast path. No change to CPU scheduler or AccelScheduler.

When compute.weight is 0 (default): existing separate knobs work exactly as they do on Linux. Zero overhead. Full backward compatibility.

54.8 Unified Memory Domain Tracking

When CPU and accelerator share physical memory (no DMA copy boundary), the memory manager should understand this for page placement:

// isle-core/src/compute/memory.rs

/// Memory domain descriptor in the unified compute topology.
pub struct MemoryDomain {
    /// NUMA node ID (integrates with existing memory manager, Section 12).
    pub numa_node: u8,

    /// Which compute units have local access to this memory?
    /// On an APU: both CPU cores and GPU CUs list the same domain.
    /// On discrete GPU: GPU CUs list VRAM domain, CPU lists DDR domain.
    /// Populated at boot/hot-plug (cold path, heap available). The count is
    /// bounded by the number of compute units in the system.
    pub local_compute_units: Vec<DeviceNodeId>,  // heap: cold-path only, after allocator init

    /// Is this domain coherent across all local compute units?
    /// true: APU shared memory, CXL 3.0 coherent pool.
    /// false: discrete GPU VRAM (requires explicit flush/invalidate).
    pub hardware_coherent: bool,

    /// Bandwidth and latency from each compute unit type.
    /// Used by page placement decisions.
    /// Populated at boot/hot-plug (cold path, heap available).
    pub access_costs: Vec<MemoryAccessCost>,  // heap: cold-path only, after allocator init
}

pub struct MemoryAccessCost {
    pub from_unit: DeviceNodeId,
    pub latency_ns: u32,
    pub bandwidth_gbs: u32,
}

What this enables:

Discrete GPU (PCIe, separate memory):
  CPU cores → DDR NUMA node 0 (latency: 80ns, BW: 50 GB/s)
  GPU SMs   → VRAM NUMA node 2 (latency: 100ns, BW: 900 GB/s)
  CPU→VRAM: latency 500ns, BW 25 GB/s (PCIe)
  → Page migration between CPU and GPU is expensive.
  → Applications MUST explicitly manage data placement (cudaMemcpy).
  → Kernel's role: NUMA-aware allocation. Same as Linux.

APU (shared memory):
  CPU cores → DDR NUMA node 0 (latency: 80ns, BW: 50 GB/s)
  GPU CUs   → DDR NUMA node 0 (latency: 90ns, BW: 45 GB/s)  ← SAME NODE
  → No page migration needed. CPU and GPU see the same pages.
  → Kernel can optimize page placement within the shared domain
    (e.g., cache-line alignment for GPU access patterns).
  → Workload migration CPU↔GPU is a scheduling decision only, no data movement.

Grace Hopper (NVLink-C2C unified memory):
  CPU cores → LPDDR5X NUMA node 0 (latency: 80ns, BW: 500 GB/s)
  GPU SMs   → HBM3 NUMA node 1 (latency: 100ns, BW: 3000 GB/s)
  CPU↔GPU:  NVLink-C2C (latency: 200ns, BW: 900 GB/s, COHERENT)
  → Hardware-coherent. Kernel can migrate pages transparently.
  → Hot pages accessed by GPU → migrate to HBM3 (faster).
  → Cold pages → migrate to LPDDR5X (more capacity).
  → Same mechanism as NUMA balancing between CPU sockets, extended to GPU.

This is an extension of the existing NUMA-aware page placement (Section 12, Section 43), not a new mechanism. The PageLocationTracker (Section 43) already tracks which NUMA node pages belong to. Unified memory domains just ensure accelerator-local memory is correctly represented as a NUMA node with proper distance/bandwidth metadata.

54.9 NVIDIA Compatibility: No Changes Required

The unified compute model is specifically designed to NOT require driver changes:

NVIDIA driver stack (discrete GPU, PCIe):

  Userspace (closed-source, binary compat):
    libcuda.so         — CUDA runtime        → unchanged
    libnvidia-ml.so    — management library   → unchanged
    libnvcuvid.so      — video decode         → unchanged
    All communicate via ioctl to kernel driver → unchanged

  Kernel driver (open-source nvidia.ko, ported per Section 42 KABI):
    Implements AccelBase vtable:
      get_utilization() → kernel reads GPU utilization, power, clock
      submit_commands() → kernel sees command flow
      set_performance_level() → kernel can request clock changes

  What the unified compute layer reads (no new driver code):
    1. GPU utilization % → from get_utilization() (already required by AccelBase)
    2. GPU power draw mW → from get_utilization() (already required)
    3. GPU clock MHz     → from get_utilization() (already required)
    4. Memory bandwidth  → from get_utilization() or static spec data

  What the unified compute layer estimates (kernel-side, no driver involvement):
    5. ComputeCapacityProfile → derived from AccelDeviceClass::GpuCompute +
       known GPU specs (SM count, tensor core presence, memory type).
       Spec database in kernel, keyed by PCI device ID. Same approach as
       Linux's GPU frequency tables.

  Optional future enhancement (minor KABI extension):
    6. get_capacity_profile() → driver provides precise profile.
       Not required. Kernel estimate works without it.

CUDA applications continue to explicitly target the GPU. The kernel does NOT intercept CUDA calls or redirect compute. The benefit is: - Better power budgeting (kernel knows GPU is the useful domain) - Better cgroup fairness (compute.weight distributes across CPU+GPU) - Better topology data for orchestrators (Kubernetes reads sysfs)

54.10 What the Kernel Does NOT Do

To be explicit about boundaries — the kernel does NOT:

  1. Automatically redirect CUDA/ROCm/oneAPI workloads between devices. Applications that explicitly target a device continue to target that device. The kernel respects explicit choices.

  2. Implement a compute compiler that translates CPU code to GPU kernels or vice versa. That's a userspace runtime concern (OpenCL, SYCL, Vulkan Compute).

  3. Require drivers to expose internal scheduling decisions. GPU drivers still schedule internally. The kernel provides cross-device orchestration data.

  4. Add overhead to the compute submission hot path. Command submission (submit_commands) goes through AccelScheduler exactly as before. The unified topology is a background advisory system consulted at ~1 second intervals.

  5. Break on systems with no accelerators. When only CPUs are present, the unified compute topology contains only CPU entries. It degrades to exactly the Section 14.5 CpuCapacity model. Zero overhead.

54.11 Sysfs Interface for Userspace Runtimes

The key practical benefit: userspace runtimes (OpenCL, SYCL, oneAPI, future CUDA alternatives) can query the kernel for topology + workload data instead of each runtime re-discovering hardware independently:

/sys/kernel/isle/compute/
    topology.json                    # Full compute topology (all units)
    unit_count                       # Number of compute units

/sys/kernel/isle/compute/unit/<id>/
    type                             # "cpu_core", "gpu_compute", "npu_engine", ...
    capacity                         # Absolute: "scalar=50000 vector=400 matrix=5 ..."
    capacity_normalized              # Normalized 0-1024: "scalar=1024 vector=390 ..."
    memory_domain                    # NUMA node ID
    memory_unified                   # "1" if shared with CPU, "0" if separate
    utilization                      # Current utilization (0-1024)
    energy_model                     # OPP table (freq, capacity, power)

/sys/fs/cgroup/<group>/
    compute.profile                  # Workload classification (read-only)
    compute.weight                   # Unified compute budget (optional, default 0)

Use case: A SYCL runtime deciding between CPU and GPU for a kernel launch:

1. Read /sys/kernel/isle/compute/unit/*/capacity
2. Read /sys/kernel/isle/compute/unit/*/memory_unified
3. Know: GPU has matrix=800, memory_unified=1 (APU).
4. Decision: matrix workload + shared memory → GPU (no copy cost).
   vs. on discrete GPU: memory_unified=0 → compare launch overhead
   + data transfer cost vs GPU throughput gain.

Today on Linux, each runtime does its own hardware discovery via vendor-specific APIs (nvml, rocm-smi, level-zero). The kernel provides no unified view. ISLE's sysfs topology eliminates redundant discovery and gives runtimes data the kernel already has (NUMA distances, power state, utilization).

54.12 Linux Compatibility

No existing Linux interfaces are affected. All new interfaces are additive:

Existing (preserved):
  /sys/devices/system/cpu/cpuN/*           — CPU topology, unchanged
  /sys/class/drm/card0/*                   — GPU sysfs, unchanged
  /dev/nvidia*, /dev/dri/*                 — device nodes, unchanged
  sched_setattr(), ioctl(GPU_SUBMIT, ...)  — syscalls, unchanged

New (additive):
  /sys/kernel/isle/compute/*               — unified compute topology
  /sys/fs/cgroup/<group>/compute.profile   — workload classification
  /sys/fs/cgroup/<group>/compute.weight    — optional unified budget

Applications unaware of new interfaces see standard Linux behavior.

54.13 Convergence Path: Accelerators as Peer Kernel Nodes

The unified compute topology (§54.4) treats accelerators as opaque compute units behind AccelBase vtables. This works Day 1 with existing proprietary firmware. But the architecture already contains the design for the next step.

Observation: every modern accelerator already has its own processor and runs its own kernel or firmware:

Device                 Processor            Runs today           Transport
─────────────────────  ───────────────────  ───────────────────  ─────────
NVIDIA GPU (Ada+)      RISC-V (GSP cores)   Proprietary μkernel  PCIe/NVLink
NVIDIA BlueField DPU   ARM A78              Full Linux kernel    PCIe
Intel Gaudi NPU        Custom cores         Firmware             PCIe
AMD Instinct           Embedded μctrl       Firmware             PCIe/xGMI
CXL memory expander    Management proc      Firmware             CXL
Crypto coprocessor     Dedicated core       Firmware/RTOS        PCIe/SPI
Future RISC-V accel    RV64 harts           Firmware             PCIe/CXL/custom

The distributed kernel (Section 47) already solves "multiple ISLE instances sharing memory and capabilities across a transport." SmartNIC/DPU offload (§48) already says "a DPU is a close remote node connected via PCIe."

The convergence: any device with its own processor is a potential peer kernel node. If a vendor replaces proprietary firmware with a ISLE-lite instance, that device becomes a full participant in the distributed kernel fabric — its memory becomes DSM-managed, its compute is visible to the cluster scheduler, capabilities flow across the interconnect.

54.13.1 The Three-Stage Adoption Path

Naming note: The "stages" below describe the accelerator integration maturity model within Section 54. They are NOT the project-wide implementation phases (Phase 1-5) defined in Section 56 (14-roadmap.md). The mapping is: Stage A (Opaque) ships with Phase 3-4 (Real Workloads / Production Ready). Stage B (Advisory) ships with Phase 4-5 (Production Ready / Ecosystem). Stage C (Peer) targets Phase 5+ (Ecosystem and beyond, vendor-driven).

Stage A — Opaque (Day 1, drop-in Linux replacement):
  ┌──────────────┐  AccelBase vtable   ┌──────────────┐
  │  ISLE   │ ──── (ioctl) ────► │  Proprietary  │
  │  (host CPU)   │                     │  firmware     │
  └──────────────┘                      └──────────────┘
  Kernel submits commands, reads telemetry. Device is a black box.
  Works with existing NVIDIA, AMD, Intel stacks. No vendor changes.

Stage B — Advisory topology (this section):
  Same as Stage A, plus:
  - Kernel builds multi-dimensional capacity profiles.
  - Workload classification drives power budgeting and intent optimization.
  - Sysfs exposes topology data for userspace runtimes.
  Still opaque device firmware. Benefits from kernel-side intelligence.

Stage C — Peer kernel node (vendor adoption):
  ┌──────────────┐  Section 47 distributed    ┌──────────────┐
  │  ISLE   │ ──── kernel ──────► │  ISLE   │
  │  (host CPU)   │  (PCIe/NVLink/CXL) │  (device)     │
  └──────────────┘                      └──────────────┘
  Device runs ISLE-lite. Becomes a node in the distributed fabric:
  - Device memory → DSM-managed (Section 47.5). Transparent page sharing.
  - Device compute → visible to cluster scheduler (Section 47.8).
  - Capabilities → network-portable across host↔device (Section 47.9).
  - Crash recovery → kernel restart on device, state preserved (Section 9).

  The application API is UNCHANGED between Stage A and Stage C.
  A CUDA app, an OpenCL app, a custom accelerator app — all work at every stage.
  Stage C just makes the device more deeply integrated and manageable.

54.13.2 Transport Unification

The architecture currently has two separate transport abstractions:

KernelTransport (Section 47.3):     RDMA-only, for inter-node distributed kernel.
OffloadTransport (§48):      PCIe/SharedMemory/RDMA, for DPU offload.

These should converge into a single NodeTransport that covers all interconnects:

// isle-core/src/transport/mod.rs

/// Unified transport between kernel nodes.
/// Covers all interconnect types: network (RDMA), local bus (PCIe),
/// chip-to-chip (NVLink, xGMI, CXL), and future interconnects.
pub enum NodeTransport {
    /// RDMA (InfiniBand, RoCE). Inter-node across network.
    /// Existing KernelTransport functionality.
    Rdma {
        device: DeviceNodeId,
        connection: RdmaConnection,
    },

    /// PCIe BAR-mapped shared memory. Host↔device on same machine.
    /// For DPUs, discrete GPUs, add-in accelerators.
    PcieBar {
        device: DeviceNodeId,
        bar_base: PhysAddr,
        bar_size: u64,
        mailbox: Option<PcieMailbox>,
    },

    /// NVLink / NVLink-C2C. Chip-to-chip, hardware-coherent.
    /// For GPU↔GPU or CPU↔GPU (Grace Hopper).
    NvLink {
        device: DeviceNodeId,
        link_id: u32,
        coherent: bool,
        bandwidth_gbs: u32,
    },

    /// CXL 3.0. Hardware-coherent shared memory.
    /// For CXL-attached accelerators, memory expanders, composable infrastructure.
    Cxl {
        device: DeviceNodeId,
        cxl_port: u32,
        coherent: bool,
        bandwidth_gbs: u32,
    },

    /// TCP/IP fallback. For non-RDMA networks.
    /// Existing fallback in KernelTransport.
    TcpFallback {
        addr: IpAddr,
        port: u16,
    },
}

Fence semantics: Each transport variant must implement a common ordering model so the distributed kernel protocol can reason about consistency without knowing the underlying interconnect:

/// Transport-agnostic memory operations.
/// Every NodeTransport variant implements this trait.
pub trait TransportOps {
    /// One-sided remote read. Returns data without interrupting remote CPU.
    /// Semantics: read is atomic for naturally-aligned loads up to 8 bytes.
    fn read(&self, remote_addr: u64, buf: &mut [u8]) -> Result<(), TransportError>;

    /// One-sided remote write. Writes data without interrupting remote CPU.
    /// Semantics: write is atomic for naturally-aligned stores up to 8 bytes.
    fn write(&self, remote_addr: u64, data: &[u8]) -> Result<(), TransportError>;

    /// Fence: all preceding operations via this transport are visible to the
    /// remote side before any subsequent operation.
    /// RDMA:  For ordering after RDMA Writes, RC QP in-order delivery
    ///        guarantees are sufficient — no explicit fence is needed.
    ///        For ordering after RDMA Reads or Atomics, post a zero-length
    ///        Send with IBV_SEND_FENCE and poll the CQ for completion.
    ///        The fence() implementation issues the appropriate mechanism
    ///        based on the preceding operation type.
    /// PCIe:  SFENCE + read-back from BAR (flush posted writes).
    /// NVLink/CXL (coherent): hardware-coherent, fence is a no-op.
    /// NVLink (non-coherent): GPU membar.sys instruction.
    /// TCP:   implicit (TCP is ordered).
    fn fence(&self) -> Result<(), TransportError>;

    /// Send a message (interrupts remote CPU). For control plane.
    fn send_message(&self, msg: &[u8]) -> Result<(), TransportError>;

    /// Is this transport hardware-coherent? If true, fence() is a no-op
    /// and the DSM directory can skip invalidation for this node pair.
    fn is_coherent(&self) -> bool;
}

TransportError Enum:

/// Errors from transport operations (used by NodeTransport trait).
#[repr(u32)]
pub enum TransportError {
    /// Connection to remote node lost (link down, node crashed).
    ConnectionLost  = 0,
    /// Operation timed out (no response within deadline).
    Timeout         = 1,
    /// Invalid remote address (unmapped, out of range).
    InvalidAddress  = 2,
    /// Permission denied (rkey mismatch, capability revoked).
    PermissionDenied = 3,
    /// Device error (NIC failure, PCIe error, CXL protocol error).
    DeviceError     = 4,
}

NodeTransport Hot-Unplug Handling:

When a device backing a NodeTransport is removed (PCIe hot-unplug, NVLink failure, CXL device removal):

  1. The device registry emits DeviceEvent::Removed for the transport device.
  2. All in-flight operations on that transport return TransportError::ConnectionLost.
  3. The distributed kernel protocol (if active) downgrades the affected node:
  4. DSM pages owned by that node become read-only on all other nodes.
  5. Capabilities issued by that node are marked suspect (cannot be renewed).
  6. If the device was a GPU running ISLE-lite (Phase 3 peer), its compute units are removed from the unified topology and its memory domains are marked unavailable.
  7. Processes with active contexts on the removed device receive SIGBUS.

Key property: The distributed kernel protocol (DSM coherence, capability exchange, heartbeat, cluster join) is transport-agnostic. It sends messages and reads/writes remote memory. Whether that goes over RDMA, PCIe BAR, NVLink, or CXL is an implementation detail. The protocol layer doesn't change.

This means a GPU running ISLE-lite joins the distributed kernel using the same protocol as a remote server — just over NVLink instead of RDMA. The cluster scheduler, DSM directory, and capability system see it as another node.

54.13.3 Any Accelerator, Any Interconnect

This model is deliberately generic. It applies to any device with a processor, regardless of function:

Device type          Compute capacity profile (§54.3 fields)  When it becomes a peer node
───────────────────  ────────────────────────────────────────  ─────────────────────────────
GPU                  vector=2000, matrix=300, scalar=200      Vendor ships ISLE firmware
NPU                  matrix=40, vector=0, scalar=0            Vendor ships ISLE firmware
DPU/SmartNIC         scalar=5000, vector=100, memory_bw=200   Already runs Linux → easy port
Crypto coprocessor   scalar=1000, vector=0, matrix=0          Vendor ships ISLE firmware
FPGA                 variable (depends on bitstream)           FPGA shell runs ISLE
DSP                  vector=500, scalar=2000, matrix=0         Vendor ships ISLE firmware
CSD (comp. storage)  scalar=3000, memory_bw=500               NVMe controller runs ISLE
Future RISC-V accel  scalar+vector (implementation defined)    Naturally runs ISLE (RISC-V)

All device types express their capacity using the five ComputeCapacityProfile fields defined in §54.3 (scalar, vector, matrix, memory_bw, launch_overhead_us). Specialized workload categories (inference, network offload, crypto, signal processing) map to combinations of these base dimensions. For example, NPU inference throughput is captured by matrix (the dominant operation), and DPU network offload is captured by scalar + memory_bw.

RISC-V accelerators are the most natural fit: ISLE has RISC-V as a first-class target architecture (Section 18), so a RISC-V-based accelerator can run the same kernel binary (with a different device tree and minimal board support).

The architecture doesn't need to predict which device types will exist. It only needs to provide: 1. A generic compute unit model (§54.3, §54.4) — works for any device. 2. A transport-agnostic distributed kernel protocol (Section 47) — works over any interconnect. 3. An adoption path that doesn't require vendors to change anything until they choose to.

Mixed-Coherence Cluster Optimization:

In a system with both coherent (CXL, NVLink-C2C) and non-coherent (RDMA, PCIe BAR) transports, the DSM protocol can skip invalidation messages for node pairs connected by coherent transports. The is_coherent() method on TransportOps enables per-pair optimization:

  • Coherent pair (e.g., CPU ↔ CXL GPU): fence() is a no-op. No invalidation messages needed — hardware maintains coherence automatically.
  • Non-coherent pair (e.g., CPU ↔ remote RDMA node): standard DSM invalidation protocol applies (explicit messages + RDMA fencing).
  • Mixed cluster: The DSM directory tracks coherence per node-pair. Invalidation fanout skips coherent pairs, reducing message traffic in heterogeneous clusters.

54.14 Performance Impact

Systems without accelerators:
  Unified topology contains only CPU entries.
  Overhead: one additional struct per CPU core (~64 bytes).
  Runtime overhead: zero. Advisory system has nothing extra to advise on.

Systems with accelerators (steady state):
  Topology update: ~1μs per accelerator per second (read get_utilization).
  Workload classification: ~2μs per cgroup per second (read perf counters).
  Cross-device energy optimization: ~1μs per cgroup per power budget tick.
  Total: ~4μs per second per cgroup. Fraction of a percent.

  Same data is already being read by AccelScheduler (Section 42.2.4) and
  PowerBudgetEnforcer (§49). Unified topology reuses those readings.
  Marginal overhead: near zero.

Compute submission hot path: UNCHANGED.
  submit_commands() → AccelScheduler → driver. No new code in this path.

Benefit: better power budgeting decisions (§54.5) save more power than
  the microseconds spent on workload classification.