Chapter 23: AI/ML Policy Framework¶
Companion to Chapter 22: AI/ML and Accelerators. This chapter contains §23.1: AI/ML Policy Framework — Closed-Loop Kernel Intelligence. See 22-accelerators.md for §22.1–§22.6 (hardware accelerator framework).
Phase 5c. Companion to Chapter 22: AI/ML and Accelerators (Section 22.1). This chapter specifies the closed-loop kernel intelligence framework: ML models that observe kernel telemetry, predict workload behavior, and adjust scheduling, memory, and I/O policies in real time. All policy decisions are bounded by hard safety invariants — ML suggestions are advisory, never authoritative over safety-critical paths.
23.1 AI/ML Policy Framework: Closed-Loop Kernel Intelligence¶
Section 22.6 defines the in-kernel inference engine — a tiny integer-only model running on the hot path (~500-5000 cycles). Section 22.6 defines Tier 2 inference services — more powerful models in userspace. But these sections describe the inference plumbing, not the policy integration: what kernel knobs can ML actually adjust, how does the kernel emit telemetry so ML has data to learn from, and how do results from an external "big model" (LLM, large transformer, RL agent) flow back into the kernel to affect scheduler behavior until the next tuning cycle?
This section defines the complete closed-loop framework.
23.1.1 Design Principles¶
Advisory, not authoritative. Every ML-adjusted parameter has a heuristic fallback. If the ML layer crashes, misbehaves, or is absent, the kernel runs its built-in heuristics. ML improves average-case performance; it never replaces correctness.
Bounded parameters. Each tunable parameter has [min, max] bounds enforced at the
kernel enforcement point. An ML model cannot set eevdf_weight_scale = 1000000 any more
than a sysctl can — the kernel clamps it. This makes the parameter space safe even if
the model is adversarially manipulated.
Temporal decay. Parameters set by ML automatically revert to their defaults after
decay_period_ms milliseconds without a refresh. If the Tier 2 service crashes or
becomes unreachable, the kernel gradually returns to baseline behavior. No explicit
"ML service is down" signal is needed.
Deterministic replay. For bug reproduction, ml_policy=disabled on the kernel
command line disables all Tier 2 observation delivery and parameter updates. All
tunable parameters hold their default values. Two machines with identical hardware
and workload produce identical scheduling/memory/network behavior — the ML layer
introduces no non-determinism.
Workload-specific, not global. ML tuning is cgroup-scoped: the ML layer can set different parameters for a latency-sensitive web server cgroup and a batch analytics cgroup running on the same machine. Global parameters exist but are the minority.
Latency tiers for AI decisions:
| Tier | Latency | Mechanism | Examples |
|---|---|---|---|
| A | < 1 μs | Pure heuristic | Page fault handler, IRQ routing |
| B | 1–50 μs | In-kernel model (Section 22.6) | Page prefetch stride, I/O queue reorder |
| C | 50 μs–5 s | Tier 2 service round-trip | NUMA migration, compression selection, power budget |
| D | 5 s–5 min | Tier 2 + external "big model" | Scheduler workload characterization, full EAS recalibration, anomaly root-cause |
23.1.2 Kernel Observation Bus¶
Every kernel subsystem that participates in ML tuning emits observations via a zero-cost macro. Observations are stored in per-CPU ring buffers and consumed asynchronously by Tier 2 policy services.
// umka-core/src/ml/observation.rs
/// Identifies the emitting subsystem.
#[repr(u16)]
/// **Note**: Discriminant 0 is intentionally reserved (no subsystem uses it).
/// `ObservationRingSet.rings` is indexed by `subsystem as usize`, making
/// `rings[0]` a permanently null dead slot. This wastes 8 bytes per CPU
/// but avoids off-by-one index arithmetic on every `observe_kernel!` call.
pub enum SubsystemId {
Scheduler = 1,
MemoryManager = 2,
TcpStack = 3,
BlockIo = 4,
PowerManager = 5,
FmaHealth = 6,
NvmeDriver = 7,
NetworkDriver = 8,
IoScheduler = 9,
Gpu = 10,
Storage = 11,
Accel = 12,
VfsLayer = 13,
ContainerMgr = 14,
// IDs 15-16 reserved for future subsystems.
// ParamId encoding uses bits [11:8] for the subsystem index (4 bits, max 16
// groups). SubsystemId is repr(u16) for wire compatibility, but ParamId
// limits the active range to 16 subsystems.
//
// **Overflow plan**: If all 16 groups are consumed and a new subsystem
// requires ML tuning, increment `PolicyRegisterRequest::version` to v2
// and widen the subsystem field to bits [13:8] (6 bits, max 64 groups).
// The ML policy service FD interface is versioned; both sides negotiate
// the version at registration time. The per-subsystem param field narrows
// to bits [5:0] (max 64 params per group), which is sufficient (current
// maximum usage is ~40 params in the Scheduler group). Alternatively,
// merge low-utilization groups (e.g., FmaHealth and PowerManager share
// similar telemetry patterns).
}
impl SubsystemId {
/// Derive the subsystem from a `ParamId` discriminant value.
/// ParamId encoding: bits [11:8] = subsystem index (4 bits, max 16 groups), bits [7:0] = param
/// within subsystem. E.g., ParamId 0x0001 = Scheduler param 1,
/// 0x0102 = MemoryManager param 2, 0x0200 = TcpStack param 0.
/// Returns `None` for out-of-range group indices. Callers must handle
/// the None case (dispatch_to_subsystem returns PolicyError::InvalidParam;
/// the registration path rejects the param).
pub const fn from_param_id(id: ParamId) -> Option<Self> {
// Extract upper byte and map to SubsystemId discriminant.
// ParamId discriminants use 0x00xx for Scheduler (SubsystemId=1),
// 0x01xx for MemoryManager (SubsystemId=2), etc.
// The +1 accounts for SubsystemId discriminants starting at 1, not 0.
// ParamId group byte 0x00 → SubsystemId 1 (Scheduler),
// group byte 0x01 → SubsystemId 2 (MemoryManager), etc.
let group = ((id as u32 >> 8) & 0x0F) as u16 + 1;
match group {
1 => Some(SubsystemId::Scheduler),
2 => Some(SubsystemId::MemoryManager),
3 => Some(SubsystemId::TcpStack),
4 => Some(SubsystemId::BlockIo),
5 => Some(SubsystemId::PowerManager),
6 => Some(SubsystemId::FmaHealth),
7 => Some(SubsystemId::NvmeDriver),
8 => Some(SubsystemId::NetworkDriver),
9 => Some(SubsystemId::IoScheduler),
10 => Some(SubsystemId::Gpu),
11 => Some(SubsystemId::Storage),
12 => Some(SubsystemId::Accel),
13 => Some(SubsystemId::VfsLayer),
14 => Some(SubsystemId::ContainerMgr),
// Groups 15-16 reserved for future subsystems.
_ => None,
}
}
}
/// Compact observation emitted by a kernel subsystem.
/// 64 bytes total — fits in one cache line. align(64) enforces
/// cache-line alignment for per-CPU ring buffer allocations.
#[repr(C, align(64))]
pub struct KernelObservation {
pub timestamp_ns: u64, // Monotonic clock (TSC-derived)
pub subsystem: SubsystemId, // Source subsystem
pub obs_type: u16, // Subsystem-defined event type (see tables below)
pub cpu_id: u16, // CPU where the event occurred
pub _pad: u16,
pub cgroup_id: u64, // Originating cgroup (0 = kernel/no cgroup)
pub features: [i32; 10], // Up to 10 integer feature dimensions.
// **Truncation warning**: values wider than 32 bits (e.g.,
// nanosecond timestamps, byte counts) must be scaled before
// storing. Use `(value >> SHIFT) as i32` or `(value / SCALE) as i32`
// to fit in 32 bits. The 64-byte cache-line constraint prevents
// using i64 (would reduce to 5 features). Subsystems needing
// wide values should use relative deltas or per-unit scaling.
// Slot assignments are subsystem-specific and defined by
// `define_feature_extractor!` declarations below. Each
// `obs_type` has its own feature layout — see the per-subsystem
// observation type tables (Scheduler §6, Memory §4, TCP, Block I/O,
// etc.) for the authoritative slot-to-metric mapping. Unused
// trailing slots are zero-filled by `collect_features!`.
}
const_assert!(core::mem::size_of::<KernelObservation>() == 64);
/// Per-CPU observation ring buffer.
/// Lock-free SPSC: kernel writes (producer), Tier 2 service reads (consumer).
/// Consumer attaches exactly one reader handle per subsystem per CPU.
/// **Allocation**: ~256 KB per instance. MUST be allocated via `vmalloc` or
/// `Box::<ObservationRing>::new_zeroed()` (which allocates directly on the
/// heap without stack construction). Stack-allocating this struct would
/// overflow the 8-16 KiB kernel stack. `Box::new(ObservationRing { ... })`
/// is INCORRECT — it constructs on the stack first, then moves to heap.
pub struct ObservationRing {
pub buf: [KernelObservation; 4096], // ~256 KB per CPU per subsystem
/// Write pointer (kernel). **Longevity**: u64 wraps after ~18.4 quintillion
/// observations. At 10M observations/sec per CPU, wrap occurs after ~58,000
/// years. Wrapping arithmetic is correct: `head - tail` computes pending
/// entries via unsigned modular subtraction regardless of wrap.
pub head: AtomicU64,
/// Read pointer (Tier 2 service). Same u64 longevity as `head`.
pub tail: AtomicU64,
/// Dropped observation count (diagnostic, not used for ring protocol).
pub overflow: AtomicU64,
}
The observe_kernel! macro emits observations with zero overhead when no consumer
is registered (one byte static key, branch predicted-not-taken). It depends on the
collect_features! helper macro to pack feature arguments into the fixed-size
[i32; 10] features array:
/// Packs up to 10 feature expressions into a `[i32; 10]` array for use in
/// `KernelObservation::features`. Accepts 1–10 positional expressions; unused
/// slots are zero-filled at compile time. All expressions must be convertible
/// to `i32`.
///
/// **Zero runtime cost**: The macro expands to a fixed-size array literal.
/// No Option, no branching, no function calls — just direct assignment
/// of the provided values and literal 0s for unused slots.
///
/// # Example
/// ```
/// let arr = collect_features!(latency_ns as i32, runqueue_len, cpu_id as i32);
/// // Expands to: [latency_ns as i32, runqueue_len as i32, cpu_id as i32, 0, 0, 0, 0, 0, 0, 0]
/// ```
macro_rules! collect_features {
// Entry: accept 1–10 comma-separated expressions.
($($feat:expr),+ $(,)?) => {{
// Count provided features at compile time.
const _N: usize = <[()]>::len(&[$(collect_features!(@unit $feat)),+]);
const _: () = assert!(_N <= 10, "collect_features! accepts at most 10 features");
// Build the array: provided values first, then zero-pad to 10.
// Each feature is cast via `as i32`. Values wider than 32 bits
// (u64, usize) MUST be pre-scaled before passing to this macro.
// The compile-time assertion below catches accidental u64/usize
// truncation: if the source type is wider than 4 bytes, the
// assertion fires. To pass a wide value intentionally, use
// explicit scaling: `(value >> SHIFT) as i32` or `(value / SCALE) as i32`.
{
let mut arr = [0i32; 10];
let mut _i = 0;
$(
// Compile-time size check: reject types wider than i32.
const _: () = assert!(
core::mem::size_of_val(&$feat) <= 4,
"collect_features!: feature wider than i32 — pre-scale before passing"
);
arr[_i] = $feat as i32;
_i += 1;
)+
arr
}
}};
(@unit $e:expr) => { () };
}
Implementation note: In the generated code the optional-slot pattern above is expressed via a counting macro that emits the correct number of
, 0zero-fillers, not viaOption. The pseudocode above conveys the intent; the actual expansion uses a standard Rust macro repetition counter pattern. The result is always a[i32; 10]with no run-time cost — the array literal is zero-cost (compiled to static data); feature slot assignments execute at runtime (same cost as a direct array write). The macro itself expands at compile time; the populated feature values are computed at the observation call site (runtime).
23.1.2.1 Typed Feature Extraction: define_feature_extractor!¶
While collect_features! packs ad-hoc expressions into the [i32; 10] observation
array at emit time, the Tier 2 policy service needs the inverse: a typed extractor
that unpacks the [i32; 10] array back into named f32 fields for inference input.
The define_feature_extractor! macro generates a per-subsystem collect() function
that converts a KernelObservation into a fixed-size [f32; N] feature vector with
named accessors.
/// Define a typed feature extractor for a subsystem's observation data.
///
/// Generates:
/// - A `struct {Name}Features` with named `f32` fields.
/// - A `fn collect(obs: &KernelObservation) -> [f32; N]` that extracts and
/// converts the `[i32; 10]` features array into `[f32; N]`.
/// - A `const FEATURE_NAMES: [&str; N]` for debug/logging.
///
/// The generated code is zero-allocation (stack-only) and suitable for use
/// in Tier 2 userspace inference pipelines.
///
/// # Example
/// ```
/// define_feature_extractor! {
/// /// Scheduler feature extractor for TaskWoke observations.
/// SchedWakeFeatures for SubsystemId::Scheduler {
/// latency_ns: [0], // features[0] → latency in nanoseconds
/// runqueue_len: [1], // features[1] → runqueue length at wakeup
/// prev_cpu: [2], // features[2] → CPU the task last ran on
/// target_cpu: [3], // features[3] → CPU selected for wakeup
/// }
/// }
///
/// // Generated API:
/// // struct SchedWakeFeatures { pub latency_ns: f32, pub runqueue_len: f32, ... }
/// // impl SchedWakeFeatures {
/// // pub fn collect(obs: &KernelObservation) -> [f32; 4] { ... }
/// // pub const FEATURE_NAMES: [&str; 4] = ["latency_ns", "runqueue_len", ...];
/// // pub fn from_obs(obs: &KernelObservation) -> Self { ... }
/// // }
/// ```
macro_rules! define_feature_extractor {
(
$(#[$meta:meta])*
$name:ident for $subsystem:expr {
$( $field:ident : [$idx:expr] ),+ $(,)?
}
) => {
$(#[$meta])*
pub struct $name {
$( pub $field: f32, )+
}
impl $name {
/// Number of features extracted by this extractor.
pub const N: usize = {
// Count fields at compile time.
let mut _n = 0usize;
$( let _ = stringify!($field); _n += 1; )+
_n
};
/// Human-readable names for each feature dimension.
/// Useful for logging, model metadata, and ONNX input naming.
pub const FEATURE_NAMES: [&'static str; Self::N] = [
$( stringify!($field), )+
];
/// The subsystem this extractor is designed for.
/// Used for debug assertions in the inference pipeline.
pub const SUBSYSTEM: SubsystemId = $subsystem;
/// Extract features from a `KernelObservation` into a fixed-size
/// `[f32; N]` array suitable for direct use as model input.
///
/// Each `features[idx]` is cast from `i32` to `f32`. The conversion
/// is lossless for values in the range `[-2^24, 2^24]` (±16M), which
/// covers all kernel observation values (latencies in ns are capped at
/// `i32::MAX` ≈ 2.1s; queue lengths, CPU IDs, and byte counts fit easily).
///
/// # Panics (debug only)
/// Debug-asserts that `obs.subsystem == Self::SUBSYSTEM`.
#[inline]
pub fn collect(obs: &KernelObservation) -> [f32; Self::N] {
debug_assert_eq!(
obs.subsystem, Self::SUBSYSTEM,
"feature extractor subsystem mismatch"
);
[ $( obs.features[$idx] as f32, )+ ]
}
/// Extract features into a named struct for ergonomic access.
#[inline]
pub fn from_obs(obs: &KernelObservation) -> Self {
debug_assert_eq!(
obs.subsystem, Self::SUBSYSTEM,
"feature extractor subsystem mismatch"
);
Self {
$( $field: obs.features[$idx] as f32, )+
}
}
}
};
}
// Concrete extractors for each subsystem's primary observation type.
define_feature_extractor! {
/// Scheduler wakeup features (SchedObs::TaskWoke).
SchedWakeFeatures for SubsystemId::Scheduler {
latency_ns: [0],
runqueue_len: [1],
prev_cpu: [2],
target_cpu: [3],
}
}
define_feature_extractor! {
/// Memory page fault features (MemObs::PageFault).
MemFaultFeatures for SubsystemId::MemoryManager {
fault_addr_page: [0],
fault_flags: [1],
alloc_latency: [2],
numa_node: [3],
lru_gen: [4],
}
}
define_feature_extractor! {
/// OOM victim selection features (MemObs::OomVictimSelection).
/// Emitted when the OOM killer is invoked, before victim selection.
OomVictimFeatures for SubsystemId::MemoryManager {
constraint_type: [0], // OomConstraint discriminant
candidate_count: [1], // Number of eligible tasks
victim_pid: [2], // Selected victim PID
victim_rss: [3], // Victim RSS in pages
victim_swap: [4], // Victim swap in pages
victim_score: [5], // Computed OOM score (0-2000)
victim_cgroup_id: [6], // Victim cgroup ID
free_pages: [7], // System free pages at OOM time
psi_stall_us: [8], // PSI memory stall last 10s (µs)
}
}
define_feature_extractor! {
/// OOM outcome features (MemObs::OomOutcome).
/// Emitted 5 seconds after OOM kill to record actual recovery metrics.
/// This is the ground-truth signal for closed-loop ML training.
OomOutcomeFeatures for SubsystemId::MemoryManager {
victim_pid: [0], // PID of killed task
pages_freed: [1], // Pages actually reclaimed
time_to_recovery_ms: [2], // ms until PSI stall < 1%
service_restart_ms: [3], // ms until replacement task appeared
cascading_kills: [4], // Additional OOM kills within 10s
oom_count_after: [5], // memory.events.oom 10s after
was_ml_adjusted: [6], // 1 if ML adjusted score, 0 if pure heuristic
baseline_score: [7], // Score without ML adjustment
}
}
define_feature_extractor! {
/// TCP stack features (TcpObs::SegmentArrival).
TcpSegmentFeatures for SubsystemId::TcpStack {
rtt_us: [0],
cwnd_segs: [1],
ssthresh_segs: [2],
in_flight: [3],
loss_count: [4],
}
}
define_feature_extractor! {
/// Block I/O features (BlockObs::RequestComplete).
BlockIoFeatures for SubsystemId::BlockIo {
latency_us: [0],
queue_depth: [1],
sector_count: [2],
is_write: [3],
}
}
The observe_kernel! macro emits observations with zero overhead when no consumer
is registered (one byte static key, branch predicted-not-taken):
Static key patching mechanism: Each observe_kernel! call site contains a NOP
instruction at compile time. When a Tier 2 policy service registers (or deregisters)
for a subsystem, the runtime patcher writes to OBSERVE_ENABLED[subsystem] and
rewrites each NOP to a short conditional jump (JMP) — or back to NOP. This is a
standard static branch / jump-label technique: the patching cost is paid once at
registration time, and every subsequent call-site check is a single correctly-predicted
branch with zero cache-miss overhead.
The static_key_enabled!(OBSERVE_ENABLED[subsystem]) check expands to the patched
branch instruction. Per-architecture patching details:
| Architecture | Disabled (NOP) | Enabled (branch) | Instruction size | Patching mechanism |
|---|---|---|---|---|
| x86-64 | 0x90 (NOP) or 0x0F 0x1F 0x00 (3-byte NOP) |
JMP rel8 or JNE rel8 |
2-5 bytes | text_poke_bp(): INT3 breakpoint → write → remove INT3 |
| AArch64 | NOP (0xD503201F) |
B <offset> (0x14xxxxxx) |
4 bytes | IPI all CPUs → dc cvau + ic ivau → dsb ish; isb |
| ARMv7 | NOP (0xE320F000) |
B <offset> (0xEAxxxxxx) |
4 bytes | stop_machine() → write → flush I-cache (mcr p15, 0, r0, c7, c5, 0) |
| RISC-V | NOP (0x00000013) |
JAL x0, <offset> (J-type) |
4 bytes | IPI fence.i on all harts (RISC-V has no coherent I-cache guarantee) |
| PPC32 | nop (0x60000000) |
b <offset> (0x48xxxxxx) |
4 bytes | stop_machine() → patch → isync on each CPU |
| PPC64LE | nop (0x60000000) |
b <offset> (0x48xxxxxx) |
4 bytes | Same as PPC32 (stop_machine + isync) |
| s390x | bc 0,0 (NOP, 0x47000000) |
brc <mask>, <offset> (0xA7x4xxxx) |
4-6 bytes | IPI all CPUs via SIGP → patch → bcr 14,0 (serializing) on each CPU |
| LoongArch64 | andi $zero,$zero,0 (NOP) |
b <offset> |
4 bytes | IPI all CPUs → patch → ibar 0 (instruction barrier) on each CPU |
The static key facility is defined above (lines 163-186 of this section). The
ML policy subsystem uses it via the generic static_key_enable()/static_key_disable()
API; the per-arch patching is handled by arch::current::text_patch.
/// Emit a kernel observation.
/// Overhead when disabled: 1–3 cycles (static branch miss rate ~0%).
/// Overhead when enabled: ~10–30 cycles (TSC read + ring buffer write).
///
/// # Example
/// ```
/// observe_kernel!(SubsystemId::Scheduler, SchedObs::TaskWoke,
/// cgroup_id, latency_ns as i32, runqueue_len, prev_cpu);
/// ```
macro_rules! observe_kernel {
($subsystem:expr, $obs_type:expr, $cgroup:expr, $($feat:expr),+) => {{
// Static key: single .byte 0x90 (NOP) patched to JMP when no consumer.
// Runtime patcher changes this when a Tier 2 service registers.
if static_key_enabled!(OBSERVE_ENABLED[$subsystem as usize]) {
// Preemption must be disabled while accessing the per-CPU ring.
// The macro acquires a preempt_disable guard to ensure the CPU ID
// read and the ring push are on the same CPU. Without this guard,
// a preemption between CpuLocal::cpu_id() and push_current_cpu()
// could push to the wrong CPU's ring (correctness issue: ring
// is SPSC with the per-CPU timer as the only consumer).
let __preempt_guard = preempt_disable();
let __obs = KernelObservation {
timestamp_ns: crate::arch::current::cpu::read_cycle_counter_ns(),
subsystem: $subsystem,
obs_type: $obs_type as u16,
cpu_id: CpuLocal::cpu_id() as u16,
_pad: 0,
cgroup_id: $cgroup,
features: collect_features!($($feat),+),
};
ObservationRing::push_current_cpu($subsystem, __obs);
drop(__preempt_guard);
}
}}
}
ObservationRing::push_current_cpu specification:
impl ObservationRing {
/// Push an observation to the current CPU's ring for `subsystem`.
/// Called from the observe_kernel! macro on the hot path.
///
/// - Accesses the per-CPU `ObservationRingSet` via `CpuLocal::get()` (no
/// lock, pinned to CPU by the caller's preempt-disable context from the
/// static key check).
/// - The ring is a fixed-capacity circular buffer (`capacity` entries,
/// power-of-two, default 4096 per subsystem per CPU). Head is the write
/// index (producer), tail is the read index (consumer).
/// - **Overflow policy**: overwrite oldest. `head` advances unconditionally
/// with `Release` ordering. If `head - tail >= capacity`, the oldest entry
/// is silently lost and `ObservationRingSet::dropped` is incremented
/// (Relaxed). The consumer detects loss via the `dropped` counter.
/// - No lock: single producer (this CPU), single consumer (the Tier 2
/// service's polling thread). SPSC semantics — head written by producer
/// with Release, read by consumer with Acquire; tail written by consumer,
/// read by producer.
pub fn push_current_cpu(subsystem: SubsystemId, obs: KernelObservation) {
let ring_set = CpuLocal::get::<ObservationRingSet>();
let ring_ptr = ring_set.rings[subsystem as usize].load(Ordering::Acquire);
if ring_ptr.is_null() {
return; // Category not active — no ring allocated.
}
// SAFETY: non-null pointer was stored by registration path with Release.
let ring = unsafe { &*ring_ptr };
let head = ring.head.load(Relaxed);
let idx = head & (ring.buf.len() as u64 - 1);
// Write observation into slot (no lock — this CPU is the sole writer).
ring.buf[idx as usize] = obs;
// Publish: consumer sees the new entry after this Release store.
ring.head.store(head.wrapping_add(1), Release);
// Check overflow: if consumer hasn't kept up, increment dropped counter.
// Use head + 1 (the value just stored) to reflect the new ring state.
let tail = ring.tail.load(Relaxed);
if head.wrapping_add(1).wrapping_sub(tail) >= ring.buf.len() as u64 {
ring_set.dropped.fetch_add(1, Relaxed);
}
}
}
/// Consumer (Tier 2 service) read protocol:
///
/// **Torn-read protection**: Because KernelObservation is 64 bytes (not
/// atomically writable), the producer's overwrite-oldest policy can cause
/// the consumer to read partially-overwritten data. The consumer uses a
/// head-snapshot validation pattern (same as Linux `perf_output_put_handle`):
///
/// ```pseudo
/// fn read_observations(ring: &ObservationRing, buf: &mut [KernelObservation]) -> usize {
/// let tail = ring.tail.load(Relaxed); // local read pointer
/// let head = ring.head.load(Acquire); // kernel write pointer
/// let available = head.wrapping_sub(tail);
/// let to_read = available.min(buf.len() as u64);
/// let mut count = 0;
/// for i in 0..to_read {
/// let idx = tail.wrapping_add(i) & (ring.buf.len() as u64 - 1);
/// // Snapshot head BEFORE reading the slot.
/// let head_before = ring.head.load(Acquire);
/// buf[count] = ring.buf[idx].read();
/// // fence(Acquire) — ensure the data read completes before re-reading head.
/// let head_after = ring.head.load(Acquire);
/// // If head advanced past our slot during the read, the data is torn.
/// // Discard: head_after - tail > ring.buf.len() means the producer
/// // has lapped us and overwritten the slot we just read.
/// if head_after.wrapping_sub(tail.wrapping_add(i)) < ring.buf.len() as u64 {
/// count += 1; // valid read
/// }
/// // else: torn observation, skip silently
/// }
/// ring.tail.store(tail.wrapping_add(to_read), Release);
/// count
/// }
/// ```
/// Threshold for triggering an FMA health alert on excessive observation drops.
/// Measured as drops per second per ring (evaluated by the periodic health check
/// timer, not inline on the hot path).
pub const OBSERVATION_DROP_ALERT_THRESHOLD: u64 = 1000;
When dropped exceeds OBSERVATION_DROP_ALERT_THRESHOLD (1000 per second per ring),
an FMA health event is raised with HealthEventClass::Generic. The policy
service can respond by reducing observation frequency via the ObservationThrottleMsg
control message.
Aggregation. Tier 2 services typically consume raw observations and aggregate them into feature vectors over configurable windows (100ms, 1s, 10s, 60s). The ring buffer provides raw data; aggregation policy is entirely in Tier 2 userspace.
23.1.3 Tunable Parameter Store¶
Every kernel subsystem that accepts ML-driven tuning registers its parameters in a
global KernelParamStore. Parameters are read by the kernel with a single atomic load
(~1–3 cycles); writes require a CAS plus a version increment.
// umka-core/src/ml/params.rs
/// A single tunable kernel parameter.
/// Layout: 128 bytes (two cache lines; align(64) rounds up the fields to prevent torn reads).
///
/// Byte layout:
/// param_id(4) + subsystem(2) + param_name(24) + _pad0(2) +
/// current(8) + default_value(8) + min_value(8) + max_value(8) +
/// decay_period_ms(4) + _pad1(4) + version(8) + last_updated_ns(8) +
/// _pad2(40) = 128 bytes.
#[repr(C, align(64))]
pub struct KernelTunableParam {
/// Unique monotonic ID assigned at registration time.
/// u32 matches wire protocol format; callers use ParamId enum for type safety.
pub param_id: u32,
pub subsystem: SubsystemId,
pub param_name: [u8; 24], // Null-terminated ASCII name
/// Explicit padding: [u8; 24] ends at offset 30; AtomicI64 needs
/// 8-byte alignment at offset 32. Must be zeroed.
pub _pad0: [u8; 2],
/// Current value (ML-adjusted). Read with AtomicI64::load(Relaxed).
pub current: AtomicI64,
pub default_value: i64,
pub min_value: i64,
pub max_value: i64,
/// After this many ms without a refresh from the ML layer, the parameter
/// automatically decays to `default_value`. 0 = no decay (permanent until reset).
pub decay_period_ms: u32,
/// Explicit padding: decay_period_ms ends at offset 68; AtomicU64 needs
/// 8-byte alignment at offset 72. Must be zeroed.
pub _pad1: [u8; 4],
/// Monotonic version, incremented on every successful update.
/// Consumers can detect parameter staleness by watching for version changes.
pub version: AtomicU64,
/// Timestamp of last ML-driven update (ns). 0 = never updated.
pub last_updated_ns: AtomicU64,
/// Trailing padding to fill the 128-byte align(64) boundary.
/// Fields end at offset 88; align(64) rounds to 128. Must be zeroed.
pub _pad2: [u8; 40],
}
const_assert!(core::mem::size_of::<KernelTunableParam>() == 128);
/// Maximum number of subsystem groups (ParamId bits [11:8]).
/// Group indices map to `SubsystemId` discriminants via `(group + 1)`:
/// 0=Scheduler, 1=MemoryManager, 2=TcpStack, 3=BlockIo, 4=PowerManager,
/// 5=FmaHealth, 6=NvmeDriver, 7=NetworkDriver, 8=IoScheduler, 9=Gpu,
/// 10=Storage, 11=Accel, 12=VfsLayer, 13=ContainerMgr, 14-15=Reserved.
/// These names match the `SubsystemId` enum variants exactly.
pub const PARAM_GROUP_COUNT: usize = 16;
/// Maximum number of parameters per group (ParamId bits [7:0]).
pub const PARAMS_PER_GROUP: usize = 256;
/// Global parameter registry. Two-level indexed by ParamId:
/// group = (param_id >> 8) & 0x0F (4 bits, max 16 groups)
/// index = param_id & 0xFF (max 256 per group)
/// Total capacity: 16 * 256 = 4096 slots (512 KiB params + 4 KiB registered bitmap).
/// O(1) lookup, no heap allocation, cache-friendly per-group access.
/// Initialized at boot from per-subsystem `register_param!` calls.
/// Note: BSS (zero-initialized, no physical pages consumed until written).
/// On memory-constrained targets, `PARAMS_PER_GROUP` can be reduced to 64
/// (128 KiB total) via compile-time configuration. The ML policy subsystem
/// is Phase 4+ and targets server workloads.
///
/// **`register_param!` macro** — registers a kernel tunable parameter at boot time.
/// Takes an explicit `ParamId` variant (the enum has fixed discriminants assigned
/// per-subsystem, so dynamic ID allocation is neither needed nor desired):
///
/// ```rust
/// /// Register a kernel tunable parameter. Expands to a KernelParamStore::register()
/// /// call at boot time. The caller supplies an explicit ParamId variant — IDs are
/// /// statically assigned in the `ParamId` enum defined below.
/// macro_rules! register_param {
/// ($id:expr, $name:literal, $default:expr, $min:expr, $max:expr, $decay_ms:expr) => {
/// KERNEL_PARAM_STORE.register(KernelTunableParam {
/// param_id: $id as u32,
/// subsystem: SubsystemId::from_param_id($id)
/// .expect("ParamId has no valid SubsystemId mapping"),
/// param_name: ascii_pad!($name),
/// _pad0: [0; 2],
/// current: AtomicI64::new($default),
/// default_value: $default,
/// min_value: $min,
/// max_value: $max,
/// decay_period_ms: $decay_ms,
/// _pad1: [0; 4],
/// version: AtomicU64::new(0),
/// // Initialize to current boot timestamp to prevent immediate
/// // decay expiry. With 0, `now_ns - 0 > decay_period_ms * 1e6`
/// // would be true on the first decay check.
/// last_updated_ns: AtomicU64::new(ktime_get_ns()),
/// _pad2: [0; 40],
/// })
/// };
/// }
///
/// // Usage example:
/// // register_param!(ParamId::SchedEevdfWeightScale, "eevdf_weight_scale", 100, 10, 1000, 5000);
/// ```
///
/// `SubsystemId::from_param_id()` derives the subsystem from the ParamId discriminant
/// (upper byte: 0x00xx = Scheduler, 0x01xx = Memory, 0x02xx = Network, etc.).
pub struct KernelParamStore {
/// MaybeUninit preserves the 128-byte alignment of KernelTunableParam
/// without the Option discriminant overhead (which would break repr(C)
/// layout assumptions for the align(64) struct).
pub params: [[MaybeUninit<KernelTunableParam>; PARAMS_PER_GROUP]; PARAM_GROUP_COUNT],
/// Bitmap tracking which slots contain initialized parameters.
/// Uses `AtomicBool` with Acquire/Release ordering because `get()`
/// reads this array lock-free (without `store_lock`) while
/// `register()` writes it under `store_lock`. Without atomics,
/// a concurrent `get()` racing with `register()` could see a
/// torn `true` value before the corresponding `params` slot is
/// fully initialized — reading uninitialized memory.
///
/// Protocol:
/// - Writer (register): initialize `params[g][i]` first, then
/// `registered[g][i].store(true, Release)`. The Release ensures
/// all writes to `params` are visible before the flag is set.
/// - Reader (get): `registered[g][i].load(Acquire)`. The Acquire
/// ensures all writes by the writer are visible after the flag
/// reads `true`.
pub registered: [[AtomicBool; PARAMS_PER_GROUP]; PARAM_GROUP_COUNT],
pub count: AtomicU32,
pub store_lock: SpinLock<()>, // Protects registration; not held during reads
}
impl KernelParamStore {
/// Look up a parameter by its `ParamId`. O(1), lock-free.
#[inline]
pub fn get(&self, id: ParamId) -> Option<&KernelTunableParam> {
let raw = id as u32;
let group = ((raw >> 8) & 0x0F) as usize;
let index = (raw & 0xFF) as usize;
if self.registered[group][index].load(Acquire) {
// SAFETY: registered[group][index] is only set to true (via
// Release store) after the corresponding params[group][index]
// has been fully initialized. The Acquire load above pairs
// with the Release store in register(), ensuring all writes
// to params[group][index] are visible.
Some(unsafe { self.params[group][index].assume_init_ref() })
} else {
None
}
}
/// Iterate all registered (initialized) parameters across all groups.
pub fn active_params(&self) -> impl Iterator<Item = &KernelTunableParam> {
self.params.iter().zip(self.registered.iter())
.flat_map(|(pg, rg)| pg.iter().zip(rg.iter()))
.filter_map(|(slot, reg)| {
if reg.load(Acquire) {
// SAFETY: registered flag (Acquire) guarantees the slot is initialized.
Some(unsafe { slot.assume_init_ref() })
} else {
None
}
})
}
}
Reading a parameter (zero overhead when compared to hardcoded values):
// Hot-path read pattern — no lock, single atomic load:
let weight_scale = PARAM_STORE.get(ParamId::SchedEevdfWeightScale)
.map_or(100, |p| p.current.load(Relaxed));
Parameter consumption points:
| Parameter | Consumer | Read Path | Frequency |
|---|---|---|---|
sched_slice_ns |
Scheduler (pick_eevdf) |
CpuLocal cached copy |
Every scheduling tick |
mem_reclaim_aggression |
kswapd reclaim scan | Direct read from KernelParamStore |
Every reclaim cycle |
io_congestion_threshold |
Block layer congestion check | Cached in RequestQueue |
Every bio submission |
Parameters are published via KernelParamStore::update() which uses atomic
stores. Consumers that cache values refresh on a configurable interval
(default: 100ms) or on explicit policy push notification.
Scope limitation: The CpuLocal cached copies are system-wide parameter values,
NOT per-cgroup overrides. When a task runs in a cgroup with ML policy overrides
(MlPolicyCss::overrides), the hot-path consumer reads the global CpuLocal cache
first (fast path: ~1 cycle), then checks for a per-cgroup override via
current_task().css_set.ml_css.overrides.lookup(param_id) (warm path: XArray lookup,
~20-50 cycles). The per-cgroup override takes precedence when present. This two-level
lookup ensures the common case (no override) pays only the CpuLocal read cost.
Decay enforcement runs once per second (not per tick). The scheduler tick checks a
decay_due flag that is set by a 1-second periodic timer. Per-cgroup decay runs on
a designated decay CPU to avoid multi-CPU races on the cgroup tree walk.
/// The CPU designated to run per-cgroup parameter decay.
/// Initialized to 0 at boot. Reassigned via CPU hotplug notifier when the
/// current decay CPU goes offline:
/// cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "ml-policy/decay",
/// decay_cpu_online, decay_cpu_offline)
/// On offline: `DECAY_CPU.store(cpumask_any_online(), Relaxed)`.
/// On online: no action (keep current decay CPU).
static DECAY_CPU: AtomicU32 = AtomicU32::new(0);
// Called from schedule_tick() — guarded by decay_due flag (1 Hz periodic timer).
// Only the designated decay CPU runs this function.
// O(active_params) with early-exit on no-expiry.
fn enforce_param_decay(now_ns: u64) {
// Phase 1: Global parameter store decay
for param in PARAM_STORE.active_params() {
if param.decay_period_ms > 0 {
let last = param.last_updated_ns.load(Acquire);
// saturating_sub prevents wrap to very large u64 if last > now_ns
// (possible on multi-socket systems with TSC skew across packages).
if now_ns.saturating_sub(last) > param.decay_period_ms as u64 * 1_000_000 {
param.current.store(param.default_value, Release);
param.version.fetch_add(1, Release);
}
}
}
// Phase 2: Per-cgroup override decay (MlPolicyCss::overrides)
// Runs only on CPU 0 (single writer avoids contention on cgroup tree walk).
// **Cost**: O(N) where N = total cgroups with ML overrides. At 1 Hz, this is
// acceptable for up to ~10,000 cgroups (walk takes <1ms at ~100ns/cgroup).
// For larger deployments, a future optimization could use a "dirty list" of
// cgroups with pending decay (linked at override write time), reducing the
// walk to O(dirty) instead of O(all).
if CpuLocal::cpu_id() == DECAY_CPU.load(Relaxed) {
enforce_cgroup_override_decay(now_ns);
}
}
/// Walk the cgroup tree and expire stale per-cgroup parameter overrides.
/// Each MlPolicyCss holds an ArrayVec of ParamOverride entries (param_id, value, expiry_ns).
/// Expired entries are removed by swapping with the last element (O(1) remove).
fn enforce_cgroup_override_decay(now_ns: u64) {
for css in css_for_each_descendant(&ML_POLICY_ROOT_CSS) {
let ml_css = css.subsys_state::<MlPolicyCss>();
// overrides is RcuCell<ArrayVec<ParamOverride, CAP_CGROUP_ML_PARAMS>>.
// RcuCell provides .read() and .update(), NOT .lock().
let current = ml_css.overrides.read();
if current.iter().any(|e| e.expiry_ns <= now_ns) {
// Clone, filter expired, publish via RCU copy-on-write.
let mut new = current.clone();
new.retain(|entry| entry.expiry_ns > now_ns);
ml_css.overrides.update(new);
}
}
}
23.1.4 Policy Consumer KABI (Tier 2 → Kernel)¶
Before the vtable is defined, the two supporting types passed during registration are specified:
// umka-core/src/ml/observation.rs (continued)
/// Per-CPU set of observation ring buffers, one per KernelObservation category.
/// Populated by the `observe_kernel!` macro; consumed by policy services.
/// The kernel mmaps this structure read-only into the Tier 2 service address space.
pub struct ObservationRingSet {
/// Per-category ring pointers. Null = unallocated (no consumer for this
/// category). Uses AtomicPtr for lock-free lazy initialization (CAS
/// null -> allocated). Indexed by `SubsystemId as usize`.
pub rings: [AtomicPtr<ObservationRing>; OBSERVATION_CATEGORY_COUNT],
/// CPU this ring set belongs to (for NUMA-aware allocation by the consumer).
pub cpu_id: u32,
/// Total observations dropped due to ring overflow since last reset.
/// Updated atomically by the kernel producer; read-only for consumers.
pub dropped: AtomicU64,
}
/// **Memory footprint**: With `OBSERVATION_CATEGORY_COUNT` (16) categories ×
/// 4096 entries × 64 bytes per `KernelObservation` = **4 MiB per CPU** if all
/// categories are active. On a 128-core system with all active: ~512 MiB.
///
/// **Policy-service-activated allocation**: Rings are NOT allocated at boot.
/// Ring for category C on CPU K is allocated when a Tier 2 policy service
/// registers with `subsystem_mask` bit C set (via `ML_POLICY_REGISTER` ioctl).
/// Allocation happens on the registration code path (cold, may sleep), not on
/// the observe path (warm). This means a system with ML policy disabled uses
/// zero ring memory. A system with only scheduler observations uses 256 KiB/CPU
/// (1 category x 4096 x 64B), not the full 4 MiB.
///
/// **Lifecycle**: allocate → store pointer (Release) → enable static key →
/// observations flow → disable static key → RCU GP → store null → free rings.
/// The static key disable happens-before the AtomicPtr null store.
///
/// **Memory budget**: categories x CPUs x (256 KB raw + summary ring).
/// Registration failure (ENOMEM) rolls back all per-CPU allocations.
/// The per-ring capacity (4096) is configurable at boot via the
/// `umka.ml.ring_capacity` command-line parameter and at runtime via
/// `/sys/kernel/umka/ml/ring_capacity`.
/// Number of observation categories (= max SubsystemId discriminant + 1,
/// rounded up to the next power of two for index-by-enum access).
pub const OBSERVATION_CATEGORY_COUNT: usize = 16;
23.1.4.1.1 Observation Summary Ring (Stress Resilience)¶
Under sustained system stress (IRQ storms, memory pressure, CPU saturation), the raw
observation ring may overflow faster than the Tier 2 policy service can drain it. When
the per-CPU dropped counter exceeds a threshold (default: 64 dropped observations),
the kernel activates a secondary aggregation path.
/// Pre-aggregated observation summary. Produced by a 100ms softirq timer
/// that drains the raw observation ring into fixed-size summaries.
/// The Tier 2 service reads summaries when raw ring overflow is detected,
/// maintaining statistical awareness even when individual observations are lost.
#[repr(C)]
pub struct ObservationSummary {
/// Observation type discriminant — identifies which metric type within
/// this subsystem's summary ring this entry aggregates (e.g.,
/// SchedObs::TaskWoke, MemObs::PageFault). The subsystem identity is
/// implicit from the ring index (ObservationRingSet.rings[subsystem_id]).
/// Without this field, the Tier 2 consumer cannot distinguish summaries
/// from different metric types within the same subsystem ring.
pub obs_type: u16,
/// Explicit padding for u64 alignment of timestamp_ns.
/// (obs_type is u16 to match KernelObservation.obs_type and all *Obs enum
/// discriminants, which are #[repr(u16)].)
pub _pad: [u8; 2],
pub _pad2: [u8; 4],
/// Aggregation window start (monotonic nanoseconds).
pub timestamp_ns: u64,
/// Number of raw observations aggregated into this summary.
/// **Overflow check**: Capped at u32::MAX; if more than ~4 billion
/// observations occur in a single 100ms window (physically impossible),
/// count saturates rather than wrapping.
pub count: u32,
/// Explicit padding for u64 alignment of sum.
pub _pad3: [u8; 4],
/// Sum of the observed metric values (for computing mean).
/// `i64` because `KernelObservation.features` are `[i32; 10]` (signed).
/// Features like `oom_score_adjustment` range [-500, +500] and delta values
/// can be negative. u64 would wrap to a very large value when summing
/// predominantly negative features. i64 safely holds u32::MAX i32 values
/// (max magnitude 9.22e18, within i64::MAX).
pub sum: i64,
/// Minimum observed value in this window.
/// `i32` to match `KernelObservation.features` signedness.
pub min: i32,
/// Maximum observed value in this window.
/// `i32` to match `KernelObservation.features` signedness.
pub max: i32,
/// Approximate 99th percentile (DDSketch or t-digest compatible).
/// `i32` to match `KernelObservation.features` signedness.
pub p99_approx: i32,
/// Trailing padding for struct alignment (8 bytes).
pub _pad4: [u8; 4],
}
const_assert!(size_of::<ObservationSummary>() == 48);
/// Per-CPU summary ring. 64 entries = 6.4 seconds of summaries at 100ms intervals.
/// Sized to bridge a worst-case stress burst without losing statistical coverage.
pub struct ObservationSummaryRing {
entries: [ObservationSummary; 64],
/// Producer index (mod 64). **Longevity**: u32 wraps after ~4.3 billion
/// entries. At 10 entries/sec (100ms intervals), wraps in ~13.6 years.
/// Acceptable: only the low 6 bits are used for indexing (mod 64), and
/// `head - tail` distance is always < 64, so wrap is transparent.
/// u64 is unnecessary because the distance (not the absolute index)
/// determines correctness.
head: AtomicU32,
tail: AtomicU32,
}
Aggregation timer: A per-CPU softirq timer fires every 100ms. On each tick:
1. Drain all entries from the raw observation ring since the last tick.
2. Compute count/sum/min/max/p99 for each metric type.
3. Write one ObservationSummary per metric type to the summary ring.
Tier 2 fallback path: The Tier 2 policy service checks the dropped counter on
each read cycle. If dropped > 0, it switches to reading from the summary ring instead
of the raw ring. This provides degraded-but-usable statistical data (mean, min, max, p99)
even during stress events when per-observation granularity is lost.
Phase assignment: Phase 5c (the ML policy framework is Phase 5c per Section 24.2). The raw observation ring is the primary data source; the summary ring is a hardening improvement for production resilience.
// umka-core/src/ml/params.rs (continued)
/// Read-only shadow view of the KernelTunableParam store for use by policy consumers.
/// Updated atomically on each policy epoch (a complete pass of decay enforcement);
/// consumers read from this without needing to acquire the param store write lock.
///
/// The kernel mmaps this structure into Tier 2 service address space as read-only.
/// Consumers detect staleness by watching `epoch`: if `epoch` changes between two reads
/// of `params[i]`, re-read the slot.
pub struct KernelParamStoreShadow {
/// Epoch counter; incremented on every param store update.
/// Odd epoch = update in progress; even epoch = stable snapshot.
/// Consumers must spin-wait for an even epoch before reading `params`.
pub epoch: AtomicU64,
/// Snapshot of all tunable parameters as of `epoch`.
/// Two-level indexed identically to `KernelParamStore`:
/// group = (param_id >> 8) & 0x0F, index = param_id & 0xFF.
/// Entries whose corresponding param is unregistered hold `i64::MIN` as a sentinel.
pub params: [[AtomicI64; PARAMS_PER_GROUP]; PARAM_GROUP_COUNT],
/// Timestamp of the last shadow update (monotonic clock, nanoseconds since boot).
pub last_update_ns: AtomicU64,
/// Count of seqlock retry exhaustions (reader hit MAX_SEQLOCK_RETRIES=16
/// without obtaining a consistent snapshot). Exposed via
/// `/sys/kernel/umka/ml/shadow_stale_reads`. Non-zero indicates the writer
/// is holding the odd epoch for too long (e.g., large batch of parameter
/// updates). The Tier 2 service receives stale data when this fires —
/// policy parameters are advisory, so stale values cause suboptimal tuning
/// for one cycle, not correctness failure. Incrementing this counter is the
/// diagnostic signal for operators to investigate writer batch sizes.
pub stale_read_count: AtomicU64,
}
/// Total capacity of the shadow store: PARAM_GROUP_COUNT * PARAMS_PER_GROUP = 4096.
/// Both `KernelParamStore` and `KernelParamStoreShadow` use identical two-level
/// indexing so param_id lookup is a single array dereference in both.
pub const MAX_TUNABLE_PARAMS: usize = PARAM_GROUP_COUNT * PARAMS_PER_GROUP;
/// Shadow update protocol:
///
/// 1. The policy consumer thread (kernel-side, one per registered service)
/// processes a batch of PolicyUpdateMsg entries.
/// 2. Before writing to the shadow store:
/// `shadow.epoch.fetch_add(1, Release);` // even → odd (in-progress)
/// 3. Write updated param values to `shadow.params[group][index]` with
/// Relaxed ordering (the epoch fence handles visibility).
/// 4. After all writes:
/// `shadow.last_update_ns.store(now_ns, Relaxed);`
/// `shadow.epoch.fetch_add(1, Release);` // odd → even (stable)
///
/// Consumer (Tier 2 service) read protocol:
/// loop {
/// let e1 = shadow.epoch.load(Acquire);
/// if e1 & 1 != 0 { core::hint::spin_loop(); continue; } // odd = in-progress
/// let val = shadow.params[group][index].load(Relaxed);
/// let e2 = shadow.epoch.load(Acquire);
/// if e1 == e2 { break val; } // consistent read
/// // epoch changed — retry. After MAX_SEQLOCK_RETRIES (16) iterations
/// // (~1µs at ~60ns/iter on modern CPUs), return the last read value
/// // as a stale-but-safe fallback. Policy parameters are advisory;
/// // a stale value causes suboptimal tuning for one cycle, not
/// // correctness failure.
/// retry_count += 1;
/// if retry_count >= MAX_SEQLOCK_RETRIES { break val; }
/// }
///
/// const MAX_SEQLOCK_RETRIES: u32 = 16;
///
/// The decay function (enforce_param_decay) also uses this epoch protocol
/// when resetting expired params in the shadow.
23.1.4.2 Per-Cgroup Parameter Overrides¶
Global parameters (cgroup_id = 0 in PolicyUpdateMsg) are stored in the
KernelParamStore. Per-cgroup overrides allow a Tier 2 policy service to tune
parameters for a specific cgroup (e.g., raise tcp_congestion_min_rtt_ms for a
latency-sensitive container) without affecting other workloads.
Per-cgroup overrides are stored in a MlPolicyCss — a cgroup subsystem state struct
attached to each cgroup via the ml_policy cgroup subsystem:
/// Per-cgroup state for the ml_policy cgroup subsystem.
/// Attached to each cgroup; access is RCU-protected for reads,
/// cgroup's own spinlock for writes.
pub struct MlPolicyCss {
/// Sparse set of parameter overrides for this cgroup.
/// Bounded at CAP_CGROUP_ML_PARAMS entries.
pub overrides: RcuCell<ArrayVec<ParamOverride, CAP_CGROUP_ML_PARAMS>>,
/// Standard cgroup subsystem state (css_id, parent link, refcount).
pub css: CgroupSubsystemState,
}
// Lifecycle: MlPolicyCss is allocated lazily on first ML parameter write to a
// cgroup. Freed in the css_free callback when the cgroup is destroyed — pending
// overrides are drained. Updates targeting a destroyed cgroup's ID are silently
// dropped (cgroup_id lookup returns None).
//
// **Cgroup controller registration**: `ml_policy` is registered as a cgroup v2
// subsystem ([Section 17.2](17-containers.md#control-groups)) at boot (Phase 5c, after cgroup
// framework is available — cgroups ship in Phase 3). Registration provides these callbacks:
// css_alloc: allocate MlPolicyCss (slab: ml_policy_css_cache).
// css_free: drain overrides, free MlPolicyCss.
// css_online: no-op (overrides are lazily populated).
// css_offline: drain overrides (same as css_free path).
// The subsystem exposes these cgroup files:
// `ml_policy.overrides` — read: JSON array of active overrides;
// write: `{"param_id":N, "value":V, "ttl_ms":T}` to set an override.
// `ml_policy.stats` — read-only: per-cgroup ML policy hit/miss counters.
// The controller has no resource charge/uncharge cycle — it is purely
// a parameter-override namespace, not a resource controller.
/// Maximum per-cgroup ML parameter overrides.
/// 8 entries is sufficient for the typical use case (scheduling + memory +
/// network tuning). 8 × 24 bytes = 192 bytes = 3 cache lines.
///
/// **Tradeoff**: With 14+ tunable subsystems, 8 overrides per cgroup is
/// tight for workloads needing per-cgroup tuning across many subsystems
/// simultaneously. However, expanding this increases the per-cgroup
/// memory footprint and worsens linear-scan latency on the lookup hot path
/// (~5 ns for 8 entries). A practical mitigation: subsystems that share
/// the same optimal tuning can be covered by a single override using
/// a composite `param_id` range. If future workloads demonstrate need
/// for >8 overrides, this constant can be increased to 16 (6 cache lines)
/// with proportional memory and latency cost.
pub const CAP_CGROUP_ML_PARAMS: usize = 8;
/// A single parameter override for one cgroup.
/// Tuple: (param_id, value, expiry_ns). The decay function
/// (`enforce_cgroup_override_decay`) removes entries where
/// `expiry_ns <= now_ns`. Set `expiry_ns = u64::MAX` for permanent overrides.
#[repr(C, align(8))]
pub struct ParamOverride {
pub param_id: u32, // matches KernelTunableParam::param_id
pub _pad: u32,
pub value: i64, // overrides KernelTunableParam::current for this cgroup
pub expiry_ns: u64, // monotonic clock expiration; u64::MAX = permanent
}
// Size is load-bearing: ArrayVec<ParamOverride, 8> = 192 bytes = 3 cache lines.
const_assert!(core::mem::size_of::<ParamOverride>() == 24);
Lookup semantics (used at subsystem integration points, §23.1.5):
- Read the current task's cgroup:
CpuLocal::current_task().cgroup. - Load the cgroup's
MlPolicyCss::overridesunder an RCU read guard. - Linear-scan the
ArrayVecfor a matchingparam_id(≤8 entries; fits in 2 cache lines; faster than a hash lookup for this size). - If found, return the override value; otherwise fall through to the global
KernelParamStore::params[param_id].current.load(Acquire).
Two-level lookup adds ~5 ns in the common case (no override: RCU dereference + small empty-array scan + global load).
Write path (PolicyUpdateMsg with non-zero cgroup_id):
- Verify caller holds
Capability::KernelMlTunescoped to the target cgroup's user namespace. - Locate the target cgroup by
cgroup_idvia RCU-protected idr lookup. - Acquire the cgroup's spinlock. Clone the current
overridesArrayVec. - Insert or update the
ParamOverrideentry in the clone. - Publish via
rcu_assign()onoverrides. Release spinlock. - The old
ArrayVecis freed after the current RCU grace period (Section 3.1).
The write path resolves the target cgroup by cgroup_id and re-validates capability
scope against the resolved cgroup under the cgroup spinlock. Cgroup IDs are monotonic
u64 counters (never recycled) — a stale cgroup_id produces ENOENT, not a
use-after-free.
A Tier 2 policy service communicates with the kernel entirely through shared-memory ring buffers and ioctls — no function-pointer callbacks. This is a hard requirement because Tier 2 services run in Ring 3 (userspace processes) and the kernel cannot call function pointers into a Ring 3 address space.
/// Registration request passed via ioctl. No function pointers —
/// all communication uses shared-memory rings.
#[repr(C)]
pub struct PolicyRegisterRequest {
/// Struct size for forward/backward compatibility.
pub struct_size: u64,
/// Protocol version (currently 1).
pub version: u32,
/// Bitmask of subsystem observation streams to subscribe to.
/// Bit positions correspond to `SubsystemId` enum values.
/// 32-bit mask supports up to 32 subsystems. Current capacity:
/// 14 used, 18 available.
pub subsystem_mask: u32,
/// Requested observation ring capacity (number of `KernelObservation`
/// entries per per-CPU ring). Must be a power of two in [256, 65536].
/// The kernel may clamp to the system-wide maximum
/// (`umka.ml.ring_capacity`). Zero means "use system default".
/// Controls observation ring capacity only. The update ring capacity is
/// configured separately via `umka.ml_policy_ring_capacity` boot
/// parameter (default: 256 entries).
pub ring_capacity: u32,
/// Reserved for future extension. Must be zeroed by the caller;
/// the kernel rejects requests where any reserved byte is non-zero.
pub _reserved: [u8; 44],
}
const_assert!(size_of::<PolicyRegisterRequest>() == 64);
/// Registration response returned by ioctl.
#[repr(C)]
pub struct PolicyRegisterResponse {
/// File descriptor for the observation ring (mmap to access).
/// The ring contains `KernelObservation` entries produced by the kernel.
/// The service polls or uses epoll on this fd for readability.
pub obs_ring_fd: i32,
/// File descriptor for the parameter update ring (mmap to access).
/// The service writes `PolicyUpdateMsg` entries; kernel consumes them.
pub update_ring_fd: i32,
/// File descriptor for the read-only parameter store snapshot (mmap to access).
/// Maps a `KernelParamStoreShadow` page. Service reads current parameter values.
pub param_store_fd: i32,
pub _pad: i32,
/// Per-registration HMAC session key (32 bytes, HMAC-SHA3-256).
/// Generated by the kernel from `get_random_bytes(32)` during ioctl
/// processing. Unique per registration; destroyed when the service
/// deregisters (fd close). The service includes this key in every
/// `PolicyUpdateMsg` HMAC computation (Section 23.1.4.1).
///
/// **Security**: The key is returned only once, in the ioctl response
/// buffer. It is never stored in procfs, sysfs, or any other
/// user-visible location. The kernel stores a copy in the
/// `PolicyServiceRegistration` struct for HMAC verification.
///
/// **Threat model note (ptrace)**: A ptrace attacker with `PTRACE_ATTACH`
/// capability already has full control over the target process (can read
/// memory, inject syscalls, modify registers). Exposure of the session
/// key via ptrace adds no attack surface beyond existing ptrace
/// capabilities. Processes requiring stronger isolation should use
/// `PR_SET_DUMPABLE(0)` or run in a non-dumpable cgroup.
pub session_key: [u8; 32],
}
// PolicyRegisterResponse: 4 + 4 + 4 + 4 + 32 = 48 bytes.
// Returned via ioctl to userspace — ABI struct requires size verification.
const_assert!(size_of::<PolicyRegisterResponse>() == 48);
Registration protocol: A Tier 2 policy service registers with the kernel via
/dev/umka_ml_policy (a miscdevice created by umka-core at boot):
- Open: The service opens
/dev/umka_ml_policy. The kernel allocates aPolicyServiceRegistrationstruct. - Register:
ioctl(fd, ML_POLICY_REGISTER, &PolicyRegisterRequest). The kernel validates: - Caller holds
Capability::KernelMlTune(per-user-namespace). struct_sizeandversionare compatible.subsystem_maskcontains only valid subsystem bits. On success, returns aPolicyRegisterResponsewith three file descriptors: the observation ring fd (readable, pollable), the update ring fd (writable), and the parameter store fd (read-only mmap).- Run: The service polls
obs_ring_fdfor new observations, readsKernelObservationentries from the mmap'd ring, runs its ML model, and writesPolicyUpdateMsgentries to the update ring. The kernel's policy consumer thread reads update messages and applies them to the parameter store. - Close / Deregistration: Closing the fd deregisters the service. The kernel performs the following steps in order:
- Disable observation delivery: Toggle the per-registration static key
to NOP, preventing new
observe_kernel!calls from writing to this service's per-CPU observation rings. After this step, no new observations will be enqueued. - RCU grace period: Wait for a full RCU grace period to ensure that all
in-flight
observe_kernel!calls (which may have entered the ring write path before the static key toggle) have completed. - Drain update ring: Read and apply all remaining
PolicyUpdateMsgentries from the update ring. Each entry is validated via HMAC; invalid entries are silently discarded. - Remove registration: Remove the
PolicyServiceRegistrationfrom the global registry. After this step, no kernel consumer thread will attempt to read from this service's rings. - Unmap rings: Unmap the observation ring, update ring, and parameter store mmap regions from the service's address space.
- Free resources: Free the per-CPU observation rings, the update ring, the session key, and the per-registration rate limiter.
- Parameter decay: Parameters managed by this service are NOT immediately
reverted. They retain their current values until
valid_for_msexpires, at which point they decay to defaults. Crash recovery: If the Tier 2 service is killed (SIGKILL, OOM), the fd is closed by the kernel's process exit cleanup path (exit_files()), triggering the same deregistration sequence (steps 1-7 above). Partially-written ring entries may contain corrupt data — step 3 validates each entry's HMAC and discards entries with invalid or incomplete HMAC fields.
ML_POLICY_REGISTER ioctl number: _IOW('M', 0x01, PolicyRegisterRequest).
Error codes returned by ML_POLICY_REGISTER:
| Errno | Condition |
|---|---|
EPERM |
Caller does not hold Capability::KernelMlTune in the current user namespace |
EINVAL |
struct_size does not match any known PolicyRegisterRequest layout version |
EINVAL |
version is not compatible with the kernel's supported range |
EINVAL |
subsystem_mask contains bits for undefined subsystems |
EBUSY |
Maximum number of concurrent policy services reached (MAX_POLICY_SERVICES, default 4) |
ENOMEM |
Failed to allocate observation ring, update ring, or parameter store mapping |
EACCES |
Policy service binary does not have a valid ML-DSA-65 signature (when CONFIG_ML_POLICY_SIGNED is enabled) |
/// Message type discriminants for the policy update ring.
/// The first byte of every message on the update ring is a `msg_type` field,
/// allowing the kernel to demultiplex `PolicyUpdateMsg` from control messages
/// (e.g., `ObservationThrottleMsg`) without ambiguity.
pub const MSG_TYPE_POLICY_UPDATE: u8 = 2;
pub const MSG_TYPE_THROTTLE: u8 = 1;
/// Message sent from Tier 2 policy service to the kernel to update a parameter.
/// Submitted via a dedicated policy update ring buffer (separate from observations).
///
/// Layout is fixed at 128 bytes for KABI stability. All implicit compiler padding
/// is made explicit via named `_pad*` fields so that the layout is identical
/// across compiler versions, optimization levels, and target architectures.
/// Future fields may be added within `_reserved`; increment `PolicyRegisterRequest::version`
/// when doing so.
///
/// **Version scope**: Individual `PolicyUpdateMsg` entries do not carry a
/// version field. The version is established per-registration (per-session)
/// via `PolicyRegisterRequest::version`. The service and kernel agree on a
/// single wire format version for the entire session. This avoids 4 bytes
/// of per-message overhead on a 128-byte message.
///
/// The `msg_type` field (first byte) is `MSG_TYPE_POLICY_UPDATE = 2`, which
/// the kernel uses to distinguish this message from control messages such as
/// `ObservationThrottleMsg` (`MSG_TYPE_THROTTLE = 1`). Without this field,
/// the low byte of `param_id` on little-endian would collide with throttle
/// messages when `param_id == 1`.
#[repr(C)]
pub struct PolicyUpdateMsg {
/// Message type discriminant: `MSG_TYPE_POLICY_UPDATE = 2`.
/// Must be the first byte so the kernel can demultiplex all
/// message types on the update ring by inspecting `ring[offset + 0]`.
pub msg_type: u8,
/// Explicit padding after msg_type to align param_id to 4 bytes.
pub _pad_mt: [u8; 3],
/// Which parameter to update.
pub param_id: u32,
/// New value — kernel clamps to [min_value, max_value] before applying.
pub new_value: i64,
/// Parameter is valid for this many ms (0 = use param's default decay_period).
/// If the service crashes, the parameter decays after this interval.
pub valid_for_ms: u32,
/// Explicit padding to align `model_seq` to 8 bytes.
pub _pad1: [u8; 4],
/// Monotonic sequence number from the ML model that produced this update.
/// Out-of-order or duplicate updates (lower seq than current) are silently dropped.
pub model_seq: u64,
/// Optional: restrict update to a specific cgroup (0 = global / kernel-wide).
pub cgroup_id: u64,
/// Reserved for future fields (increment `version` when adding).
/// Placed before HMAC so future fields are automatically authenticated.
pub _reserved: [u8; 56],
/// HMAC-SHA3-256 over all preceding bytes (msg_type through _reserved).
/// HMAC-last ensures any future field addition in _reserved is
/// automatically covered by the authentication. Computed with the
/// per-registration session key (derived during ioctl handshake).
pub hmac: [u8; 32],
}
const _: () = assert!(core::mem::size_of::<PolicyUpdateMsg>() == 128);
/// Control message sent by a Tier 2 policy service to throttle observation
/// delivery. Written to the update ring with `msg_type = MSG_TYPE_THROTTLE`
/// (discriminated by the first byte, same as `PolicyUpdateMsg::msg_type`).
///
/// Use case: when the service's model cannot keep up with observation rate
/// (detected via `ObservationRingHeader::dropped` counter), it sends a
/// throttle message to reduce kernel-side emission.
#[repr(C)]
pub struct ObservationThrottleMsg {
/// Message type discriminant: `MSG_TYPE_THROTTLE` (= 1).
pub msg_type: u8,
pub _pad0: [u8; 3],
/// Target subsystem to throttle (SubsystemId discriminant value).
/// 0xFF = all subsystems for this registration.
///
/// **Scope**: Throttle affects only this service's observation ring
/// delivery, not kernel-side emission or other registered services.
/// The kernel continues to emit observations internally (for other
/// consumers or for internal health monitoring) — only the delivery
/// to this service's per-CPU observation ring is sampled down.
pub subsystem: u8,
pub _pad1: [u8; 3],
/// Throttle factor: emit 1-in-N observations. 1 = no throttle (full rate).
/// 2 = emit every other observation. 0 = pause entirely (no observations).
/// The kernel applies this as a per-subsystem sampling divisor.
pub throttle_factor: u32,
/// Duration in milliseconds. After this period, the kernel reverts to
/// full-rate emission (throttle_factor = 1). 0 = indefinite (until next
/// throttle message or service deregistration).
pub duration_ms: u32,
/// Reserved for future fields. Placed before HMAC so that future fields
/// are automatically covered by HMAC verification (same pattern as
/// PolicyUpdateMsg).
pub _reserved: [u8; 80],
/// HMAC-SHA3-256 over all preceding bytes (msg_type through _reserved)
/// using the per-registration session key.
/// Replay protection unnecessary: throttle messages are written to the
/// service's own ring. Replay is a self-attack.
pub hmac: [u8; 32],
}
const _: () = assert!(core::mem::size_of::<ObservationThrottleMsg>() == 128);
The kernel validates the HMAC on every message consumption. Invalid HMACs increment a
per-service hmac_reject_count counter and raise an FMA alert.
Policy update ring buffer — shared memory between Tier 2 service and kernel:
/// Ring buffer for PolicyUpdateMsg entries. Allocated by the kernel,
/// mmap'd into the Tier 2 service's address space (writable for the service,
/// readable for the kernel).
///
/// Layout: [PolicyUpdateRingHeader][PolicyUpdateMsg; capacity]
/// The service writes messages at `write_idx`, the kernel reads from `read_idx`.
#[repr(C)]
pub struct PolicyUpdateRingHeader {
/// Ring capacity in entries (power of two). Set by kernel at allocation.
pub capacity: u32,
pub _pad0: u32,
/// Write index (next slot to write). Updated by the Tier 2 service
/// with Release ordering after writing the message.
pub write_idx: AtomicU64,
/// Read index (next slot to read). Updated by the kernel's policy
/// consumer thread with Release ordering after processing.
pub read_idx: AtomicU64,
/// Overflow counter (incremented by service if ring is full).
pub overflow: AtomicU64,
}
const_assert!(core::mem::size_of::<PolicyUpdateRingHeader>() == 32);
/// Default capacity: 256 entries (256 × 128 = 32 KiB + 32-byte header).
/// Fits in 9 4 KiB pages. The kernel allocates this from its ring buffer
/// slab and maps it into the service's address space via the `update_ring_fd`
/// returned by ML_POLICY_REGISTER.
///
/// **Override**: The service may request a different capacity via the
/// `PolicyRegisterRequest::ring_capacity` field (power-of-two, clamped to
/// [64, 4096]). If 0 or omitted, the default is used. The kernel parameter
/// `umka.ml_policy_ring_capacity` (boot parameter) sets the system-wide
/// default, overriding this constant. Larger rings reduce backpressure
/// for bursty services at the cost of per-registration memory.
pub const POLICY_UPDATE_RING_CAPACITY: u32 = 256;
The service submits updates by writing a PolicyUpdateMsg at
buffer[write_idx & (capacity - 1)], then incrementing write_idx with
Release ordering. The kernel's policy consumer thread (one per registered
service) polls write_idx > read_idx and processes entries sequentially.
If write_idx - read_idx >= capacity, the ring is full and the service
must wait (spin or poll on read_idx advancing).
The kernel validates all PolicyUpdateMsg entries before applying:
1. param_id must exist in KernelParamStore
2. new_value is clamped to [min_value, max_value] (never rejected; clamped silently)
3. Caller must hold Capability::KernelMlTune (Tier 2 services receive this at registration if granted by the operator; see Section 23.1.8)
4. model_seq must be ≥ the last applied seq for this (param_id, cgroup_id) pair
Parameter ID enumeration — Every tunable parameter has an explicit param_id value.
IDs are partitioned by subsystem prefix to allow independent subsystem evolution.
The KernelTunableParam::param_id field stores these values.
ParamId ranges are allocated per subsystem in 256-entry blocks, mapped to
SubsystemId via (param_id >> 8) & 0x0F + 1:
| Range | Group | SubsystemId | Name |
|---|---|---|---|
| 0x0000-0x00FF | 0 | 1 | Scheduler |
| 0x0100-0x01FF | 1 | 2 | MemoryManager |
| 0x0200-0x02FF | 2 | 3 | TcpStack |
| 0x0300-0x03FF | 3 | 4 | BlockIo |
| 0x0400-0x04FF | 4 | 5 | PowerManager |
| 0x0500-0x05FF | 5 | 6 | FmaHealth |
| 0x0600-0x06FF | 6 | 7 | NvmeDriver |
| 0x0700-0x07FF | 7 | 8 | NetworkDriver |
| 0x0800-0x08FF | 8 | 9 | IoScheduler |
| 0x0900-0x09FF | 9 | 10 | Gpu |
| 0x0A00-0x0AFF | 10 | 11 | Storage |
| 0x0B00-0x0BFF | 11 | 12 | Accel |
| 0x0C00-0x0CFF | 12 | 13 | VfsLayer |
| 0x0D00-0x0DFF | 13 | 14 | ContainerMgr |
IDs within a range are assigned contiguously from the base.
/// Well-known parameter IDs. Subsystem ranges are 256 entries wide;
/// new parameters are appended within their range (never reuse deleted IDs).
#[repr(u32)]
pub enum ParamId {
// — Scheduler (0x0000–0x00FF) —
SchedEevdfWeightScale = 0x0000,
SchedMigrationBenefitThreshold = 0x0001,
SchedPreemptionLatencyBudget = 0x0002,
SchedEasEnergyBias = 0x0003,
SchedCfsBurstQuotaUs = 0x0004,
/// Intent scheduler PID controller: proportional gain for latency.
/// This is an ML policy tuning knob that controls HOW the intent
/// scheduler adjusts parameters — not part of the ResourceIntent struct
/// itself (defined in Section 7.7.1). ResourceIntent declares WHAT the
/// workload needs (target_latency_ns, target_ops_per_sec, etc.); these
/// ParamId entries control the PID controller that tunes kernel internals
/// to satisfy those intent targets.
SchedIntentKpLatency = 0x0005,
/// Intent scheduler PID controller: derivative gain for latency.
SchedIntentKdLatency = 0x0006,
/// Maximum single-step adjustment magnitude for intent tuning.
SchedIntentMaxAdjustment = 0x0007,
/// Hold time (ms) after an intent adjustment before re-evaluating.
SchedIntentHoldTimeMs = 0x0008,
/// Resched urgency default. Controls whether `resched_curr()` uses `Eager`
/// (immediate IPI) or `Lazy` (deferred to next tick). UmkaOS-original enum
/// `ReschedUrgency` — see {ref:eevdf-scheduler#reschedurgency-enum} <!-- UNRESOLVED -->.
/// Range: 0 = Lazy (default), 1 = Eager. ML can shift the default policy
/// per cgroup based on latency sensitivity.
SchedReschedUrgency = 0x0009,
// — Memory Manager (0x0100–0x01FF) —
MemReclaimAggressiveness = 0x0100,
MemPrefetchWindowPages = 0x0101,
MemNumaMigrationThreshold = 0x0102,
MemCompressEntropyThreshold = 0x0103,
MemSwapLocalRatio = 0x0104,
/// OOM victim score adjustment from ML model. Range: [-500, +500].
/// Applied additively to the base OOM score. Bounded to prevent
/// overriding administrator-set oom_score_adj by more than 25% of total range.
/// See [Section 4.2](04-memory.md#physical-memory-allocator--ml-policy-integration-intelligent-oom-victim-selection).
MemOomScoreAdjustment = 0x0105,
/// OOM proactive eviction threshold. When ML predicts OOM within N seconds,
/// trigger proactive soft reclaim (memory.reclaim) on the predicted victim's
/// cgroup before hard OOM occurs. Range: 0 (disabled) to 60 (seconds).
MemOomProactiveEvictionSec = 0x0106,
// — TCP / Network (0x0200–0x02FF) —
NetTcpInitialCwndScale = 0x0200,
NetBbrProbeRttIntervalMs = 0x0201,
NetTcpPacingGainPct = 0x0202,
NetEcnAggressiveness = 0x0203,
// — BlockIo (0x0300–0x03FF) —
BlockIoBatchSizeThreshold = 0x0300,
// — Power Manager (0x0400–0x04FF) —
PowerRaplPackagePowerW = 0x0400,
PowerCpuFreqMinMhz = 0x0401,
PowerAccelPowerCapW = 0x0402,
PowerThermalTargetC = 0x0403,
// — FMA / Observability (0x0500–0x05FF) —
FmaAnomalyAlertThreshold = 0x0500,
FmaHealthCheckIntervalMs = 0x0501,
FmaErrorRateWindowMs = 0x0502,
// — NvmeDriver (0x0600–0x06FF) —
NvmeQueueDepthTuning = 0x0600,
// — NetworkDriver (0x0700–0x07FF) —
NetDriverCoalesceUs = 0x0700,
// — I/O Scheduler (0x0800–0x08FF) —
IoReadaheadPages = 0x0800,
IoQueueDepthTarget = 0x0801,
IoLatencyTargetUs = 0x0802,
// — Gpu (0x0900–0x09FF) —
GpuFreqTargetMhz = 0x0900,
// — Storage (0x0A00–0x0AFF) —
StorageWritebackThresholdPct = 0x0A00,
// — Accel (0x0B00–0x0BFF) —
AccelBatchSizeHint = 0x0B00,
// — VfsLayer (0x0C00–0x0CFF) —
VfsDentryCacheTargetPct = 0x0C00,
// — ContainerMgr (0x0D00–0x0DFF) —
CgroupMemoryReclaimPressurePct = 0x0D00,
}
/// Per-subsystem observation type enums. The `obs_type` field in
/// `KernelObservation` is cast from these enums (u16 discriminant).
/// Scheduler observation types.
#[repr(u16)]
pub enum SchedObs {
/// Task wakeup latency.
TaskWoke = 0,
/// NUMA/load migration decision.
MigrateDecision = 1,
/// Preemption occurred.
PreemptionEvent = 2,
/// Per-CPU runqueue snapshot (every 10ms).
RunqueueStats = 3,
/// EAS placement decision.
EasDecision = 4,
/// Intent-based scheduler feedback: current intent status for a cgroup.
/// Emitted by `intent_observe_status()` on each intent evaluation cycle.
IntentStatus = 5,
/// Latency histogram snapshot for intent-based scheduling evaluation.
/// Emitted periodically (every 60s) by the intent scheduler subsystem.
LatencyHistogram = 6,
}
/// Memory manager observation types.
#[repr(u16)]
pub enum MemObs {
/// Page fault event.
PageFault = 0,
/// Which page was evicted from LRU.
EvictionDecision = 1,
/// NUMA migration result.
NumaMigration = 2,
/// Previously-evicted page faulted again.
RefaultRecord = 3,
/// Memory pressure snapshot (every 1s).
MemPressure = 4,
/// OOM killer invoked — candidate features for victim selection.
/// Features: constraint_type, candidate_count, victim_pid, victim_rss,
/// victim_swap, victim_score, victim_cgroup_id, free_pages, psi_stall_us.
/// See [Section 4.2](04-memory.md#physical-memory-allocator--ml-policy-integration-intelligent-oom-victim-selection).
OomVictimSelection = 5,
/// OOM outcome — emitted 5s after kill with measured recovery metrics.
/// Features: victim_pid, pages_freed, time_to_recovery_ms, service_restart_ms,
/// cascading_kills, cgroup_oom_count_after, was_ml_adjusted, baseline_score.
/// This is the ground-truth feedback signal for the OOM ML model.
OomOutcome = 6,
}
/// TCP / Network observation types.
#[repr(u16)]
pub enum NetObs {
/// Congestion window event.
CongestionEvent = 0,
/// Per-flow statistics snapshot.
FlowStats = 1,
/// Routing decision.
RouteDecision = 2,
}
/// I/O scheduler observation types.
#[repr(u16)]
pub enum IoObs {
/// I/O request completion latency.
IoCompletion = 0,
/// Queue depth snapshot.
QueueDepthStats = 1,
}
/// FMA / Observability observation types.
#[repr(u16)]
pub enum FmaObs {
/// Device health metric update.
FmaHealth = 0,
/// Anomaly score exceeded threshold.
AnomalyAlert = 1,
}
/// Power manager observation types.
#[repr(u16)]
pub enum PowerObs {
/// RAPL energy sample.
EnergySample = 0,
/// Thermal throttle event.
ThermalThrottle = 1,
/// CPU frequency transition.
FreqTransition = 2,
}
23.1.5 Subsystem Integration Catalog¶
Each subsystem that participates in ML tuning registers its parameters at boot. The tables below define the initial parameter sets and observation types.
23.1.5.1.1 Scheduler (Section 7.1)¶
Observation types (SchedObs):
| obs_type | features[0..9] | Meaning |
|---|---|---|
TaskWoke |
latency_ns, runq_len, cpu, prev_cpu, cgroup_id | Task wakeup latency |
MigrateDecision |
src_cpu, dst_cpu, task_weight, queue_diff, benefit_ns | NUMA/load migration |
PreemptionEvent |
preemptor_prio, preemptee_prio, cgroup_id, — | Preemption occurred |
RunqueueStats |
runq_len, avg_vruntime, nr_throttled, nr_rt, cgroup_id | Per-CPU runqueue snapshot (every 10ms) |
EasDecision |
task_cgroup, chosen_cpu, energy_delta_uw, load_delta, — | EAS placement |
Tunable parameters (SchedParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
eevdf_weight_scale |
100 | 50 | 200 | 60s | Scale factor for virtual deadline computation (100 = baseline) |
migration_benefit_threshold |
1000 | 100 | 50000 | 30s | Minimum ns benefit to justify task migration |
preemption_latency_budget |
1000 | 100 | 10000 | 30s | Maximum μs a lower-priority task runs before preemption check |
eas_energy_bias |
50 | 0 | 100 | 60s | 0 = performance, 100 = max energy saving in EAS decisions |
cfs_burst_quota_us |
0 | 0 | 100000 | 60s | CFS burst tolerance for cgroup (0 = disabled) |
23.1.5.1.2 Memory Manager (Section 4)¶
Observation types (MemObs):
| obs_type | features[0..9] | Meaning |
|---|---|---|
PageFault |
cgroup, fault_type, addr_band, file_offset_band, prefetch_hit | Page fault event |
EvictionDecision |
cgroup, evicted_page_type, lru_age, refault_distance, — | Which page was evicted |
NumaMigration |
src_node, dst_node, cgroup, pages_moved, benefit_ns | NUMA migration result |
RefaultRecord |
cgroup, file_inode, page_offset, time_since_evict_ms | Previously-evicted page faulted again |
MemPressure |
node, free_pages, anon_pages, file_pages, slab_pages | Memory pressure snapshot (every 1s) |
OomVictimSelection |
constraint_type, candidate_count, victim_pid, victim_rss, victim_swap, victim_score, victim_cgroup_id, free_pages, psi_stall_us | OOM killer invoked — candidate features for ML-adjusted victim selection |
OomOutcome |
victim_pid, pages_freed, time_to_recovery_ms, service_restart_ms, cascading_kills, oom_count_after, was_ml_adjusted, baseline_score | Ground-truth feedback: measured outcome 5s after OOM kill |
Tunable parameters (MemParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
reclaim_aggressiveness |
100 | 25 | 400 | 30s | LRU reclaim rate relative to baseline |
prefetch_window_pages |
8 | 1 | 128 | 30s | Max pages to prefetch per fault event |
numa_migration_threshold |
200 | 50 | 2000 | 60s | Minimum benefit (ns) per page to trigger NUMA migration |
compress_entropy_threshold |
128 | 64 | 255 | 60s | Page entropy (0–255) above which compression is skipped |
swap_local_ratio |
80 | 0 | 100 | 30s | % of swap that goes to local vs RDMA remote swap (Section 5) |
oom_score_adjustment |
0 | -500 | 500 | 30s | ML-derived OOM score adjustment per candidate. Bounded: cannot override admin-set oom_score_adj by more than 25% of range. See Section 4.2. |
oom_proactive_eviction_sec |
0 | 0 | 60 | 30s | Proactive soft reclaim window (seconds). When ML predicts OOM within this horizon, trigger memory.reclaim on the predicted victim's cgroup before hard OOM occurs. 0 = disabled (pure reactive). |
23.1.5.1.3 TCP / Network (Section 16.8)¶
Observation types (NetObs):
| obs_type | features[0..9] | Meaning |
|---|---|---|
CongestionEvent |
cgroup, cwnd, rtt_us, retransmits, bandwidth_mbps | Congestion window event |
FlowStats |
cgroup, bytes_sent, bytes_recv, rtt_p99_us, loss_pct | Per-flow statistics snapshot |
RouteDecision |
src_addr_band, dst_addr_band, chosen_dev, alternative_dev, latency_us | Routing decision |
Tunable parameters (NetParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
tcp_initial_cwnd_scale |
10 | 2 | 100 | 30s | Initial congestion window (segments) per cgroup |
bbr_probe_rtt_interval_ms |
10000 | 200 | 60000 | 60s | BBR min-RTT probe interval |
tcp_pacing_gain_pct |
125 | 100 | 200 | 30s | BBR pacing gain percentage |
ecn_aggressiveness |
1 | 0 | 3 | 60s | 0=off, 1=ECT(1), 2=ECT(0), 3=always mark |
23.1.5.1.4 I/O Scheduler (Section 15.18)¶
Tunable parameters (IoParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
readahead_pages |
32 | 1 | 512 | 30s | Readahead window in pages |
queue_depth_target |
32 | 4 | 1024 | 30s | Target NVMe queue depth per cgroup |
latency_target_us |
0 | 0 | 100000 | 30s | 0 = throughput mode; >0 = latency target (μs) |
23.1.5.1.5 Power Manager (Section 7.7)¶
Tunable parameters (PowerParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
rapl_package_power_w |
0 | 5 | 400 | 10s | CPU package power cap (W); 0 = no cap |
cpu_freq_min_mhz |
0 | 0 | 10000 | 10s | Minimum CPU frequency; 0 = hardware default |
accel_power_cap_w |
0 | 0 | 1000 | 10s | Per-accelerator power cap; 0 = hardware default |
thermal_target_c |
95 | 60 | 105 | 5s | Thermal throttle target (°C) |
23.1.5.1.6 FMA / Observability (Section 20.1)¶
Tunable parameters (FmaParam):
| param_name | default | min | max | decay | Effect |
|---|---|---|---|---|---|
anomaly_alert_threshold |
80 | 10 | 100 | 60s | Score (0–100) above which FMA raises alert |
health_check_interval_ms |
1000 | 100 | 60000 | 0 | How frequently to poll device health counters |
error_rate_window_ms |
5000 | 100 | 300000 | 0 | Error rate measurement window |
23.1.6 Heavy Model Integration Pattern¶
The "big model" pattern enables a Tier 2 driver to call any model — a large transformer,
an LLM, a remote inference service, or a custom RL policy — and feed the results back as
PolicyUpdateMsg entries. The kernel is unaware of where the result came from; it only
sees a bounded parameter update through the standard mechanism.
Data flow (ring buffer structs: PolicyUpdateRingHeader and PolicyUpdateMsg
in Section 23.1; observation rings:
ObservationRing in Section 23.1):
╔═══════════════════════════════════════════════════════════════════╗
║ UmkaOS Core (Tier 0, Ring 0) ║
║ ║
║ [observe_kernel! macro calls] ──► [ObservationRing per CPU] ║
║ ║
║ [KernelParamStore] ◄── [PolicyUpdateMsg ring] ◄── [validation] ║
╚══════════════╬══════════════════════════════╬═════════════════════╝
║ mmap ObservationRings ║ PolicyUpdateMsg
▼ (read-only, zero-copy) ║ (write ring, validated)
╔══════════════════════════════════╗ ║
║ Tier 2 Policy Service Process ║ ║
║ (Ring 3, hardware-isolated) ║ ║
║ ║ ║
║ ┌──────────────────────────┐ ║ ║
║ │ Observation aggregator │ ║ ║
║ │ (100ms/1s/10s windows) │ ║ ║
║ └──────────┬───────────────┘ ║ ║
║ │ feature vector ║ ║
║ ▼ ║ ║
║ ┌──────────────────────────┐ ║ ║
║ │ Inference layer │ ║ ║
║ │ (in-process small model │ ║ ║
║ │ OR call big model ──►──╫────╫─►external)║
║ │ ◄── result ────────────╫────╫─◄──────── ║
║ └──────────┬───────────────┘ ║ ║
║ │ PolicyUpdateMsg ║ ║
║ └────────────────────╫────────────╝
║ ║
╚══════════════════════════════════╝
External big model call — concrete example:
A Tier 2 scheduler policy service runs a 5-minute characterization cycle:
Every 5 minutes:
1. Drain last 5 minutes of Scheduler + Memory ObservationRings into feature matrix
(per-cgroup stats: avg task latency, cache miss rate, memory pressure, IPC, etc.)
2. If local small model is confident (prediction score > 0.85):
→ Apply parameter updates directly (Tier C path, ~100ms latency)
3. If confidence is low OR first boot OR significant workload shift detected:
→ Serialize feature matrix as JSON/protobuf
→ Call external inference service via UNIX socket or HTTP:
POST /analyze { features: [...], cgroup_ids: [...] }
→ Service runs large model (XGBoost, small transformer, RL policy)
(100ms – 5s latency acceptable at this tier)
→ Parse response: { cgroup_id: 42, param_id: "eevdf_weight_scale", value: 130 }
4. For each (param_id, new_value) in response:
→ Validate cgroup ownership (service can only tune cgroups it owns)
→ Submit PolicyUpdateMsg with valid_for_ms = 300_000 (expires in 5 min)
→ Kernel applies on next parameter read (atomic store, ~5 cycles)
5. Log all updates to FMA audit ring ([Section 20.1](20-observability.md#fault-management-architecture))
→ {timestamp, service_id, model_version, param_id, old_value, new_value}
The external service can be: - A local Python/Rust process running PyTorch, XGBoost, or scikit-learn - A remote inference microservice (gRPC or HTTP) running on another node - An LLM with a structured output schema (for workload characterization and root-cause analysis) - A reinforcement learning agent maintaining state across tuning cycles
None of this requires any kernel changes: the kernel only sees PolicyUpdateMsg entries.
23.1.7 Model Weight Update Flow¶
When a Tier 2 service trains or fine-tunes an in-kernel model (Section 22.6), it
ships updated weights via the existing sysfs interface (Section 22.6). The update
is atomic from the kernel's perspective:
Tier 2 online learning loop:
Every N minutes (configurable per service):
1. Extract recent training data from ObservationRings + ground-truth outcomes
(ground truth: observed page refault rates for prefetch model; actual I/O
latencies for I/O scheduler model)
2. Run mini-batch update in Tier 2 userspace (full FP, no kernel restrictions)
3. Quantize new weights to INT8/INT16 using the .umkaml binary format ([Section 22.6](22-accelerators.md#in-kernel-inference-engine--umkaml-binary-format))
4. Validate model offline:
- Pass the model through the load-time validator ([Section 22.6](22-accelerators.md#in-kernel-inference-engine--umkaml-binary-format))
- Run 1000 representative inputs; compare against previous model
- Accept only if accuracy delta > -2% (do not regress more than 2 points)
5. Write to /sys/kernel/umka/inference/models/<model_name>/model.bin
→ Kernel receives write, invokes load-time validator ([Section 22.6](22-accelerators.md#in-kernel-inference-engine--umkaml-binary-format))
→ On validation pass: CAS swap of AtomicModelRef pointer
→ Old model freed after RCU grace period
→ New weights active for next inference call (within microseconds)
6. If validation fails: keep previous model, log failure to FMA ring
KernelModel — the canonical in-kernel inference model representation is defined
in Section 22.6 (KernelModel struct). The fields relevant to
the policy framework's quantized linear inference path are documented here for
context; the engine definition is authoritative.
// umka-core/src/inference/model.rs — see [Section 22.6](22-accelerators.md#in-kernel-inference-engine) for
// the canonical KernelModel struct and KernelModelType enum.
//
// KernelModelType has four variants: DecisionTree, LookupTable, LinearModel,
// TinyNeuralNet. The full `run_inference()` dispatch and per-variant algorithms
// are defined in [Section 22.6](22-accelerators.md#in-kernel-inference-engine). This section documents only
// the quantized linear inference path used by the policy framework.
/// Maximum input feature dimension (enforced at model load time).
/// With output_dim ≤ 16, worst case is 64 * 16 = 1024 MACs = <1μs on modern CPUs.
pub const MODEL_MAX_INPUT_DIM: u16 = 64;
/// Maximum output dimension (enforced at model load time).
pub const MODEL_MAX_OUTPUT_DIM: u16 = 16;
// The following is the LinearModel-specific inference path (called by the
// canonical `KernelModel::run_inference()` when `model_type == LinearModel`).
// It is shown here for context because the ML policy framework primarily uses
// LinearModel for its low-overhead quantized inference.
impl KernelModel {
/// Quantized linear inference: dot product of features with weight matrix + bias.
///
/// **Called only when `self.model_type == KernelModelType::LinearModel`.**
/// For other model types, see [Section 22.6](22-accelerators.md#in-kernel-inference-engine) for `run_decision_tree()`,
/// `run_lookup_table()`, and `run_neural_net()`.
///
/// Weight layout in `self.params`:
/// - Bytes `[0 .. input_features * outputs)`: weight matrix W (row-major, i8 quantized).
/// Row `j` (output j) spans bytes `[j * input_features .. (j+1) * input_features)`.
/// - Bytes `[input_features * outputs .. input_features * outputs + outputs * 4)`:
/// bias vector B (each bias is i32, little-endian).
///
/// For each output j: `out[j] = sum(features[i] * W[j][i] for i in 0..input_features) + B[j]`
///
/// Returns output in the caller-provided `output` slice:
/// - If `outputs == 1`: a single scalar result.
/// - If `outputs > 1`: the full output vector. The caller determines interpretation
/// based on the policy context (argmax for classification, first value for regression).
///
/// # Performance bound
///
/// The inference loop is bounded by `input_features * outputs` multiply-accumulate
/// operations. With `MODEL_MAX_INPUT_DIM` (64) and `MODEL_MAX_OUTPUT_DIM` (16),
/// worst case is 1024 MACs = <1μs on modern CPUs. No heap allocation, no floating
/// point — integer-only quantized inference.
fn run_linear_model(&self, input: &[i32], output: &mut [i32]) {
let in_dim = self.input_features as usize;
let out_dim = self.outputs as usize;
let weight_bytes = in_dim * out_dim;
let bias_offset = weight_bytes;
for j in 0..out_dim {
let row_start = j * in_dim;
let mut acc: i32 = 0;
for i in 0..in_dim {
let w = self.params[row_start + i] as i8;
acc = acc.wrapping_add(input[i].wrapping_mul(w as i32));
}
let b_off = bias_offset + j * 4;
let bias = i32::from_le_bytes([
self.params[b_off],
self.params[b_off + 1],
self.params[b_off + 2],
self.params[b_off + 3],
]);
output[j] = acc.wrapping_add(bias);
}
}
}
pub enum ModelError {
/// Input feature vector dimension does not match model.input_features.
InvalidDimensions,
/// Model weights failed integrity check (CRC mismatch).
WeightCorruption,
/// Inference produced out-of-range output.
InferenceOverflow,
}
AtomicModelRef — the kernel's handle on the active in-kernel model:
// umka-core/src/inference/model.rs (continued)
/// RCU-protected reference to the active in-kernel model.
/// Replaced atomically on weight update; old model freed after grace period.
pub struct AtomicModelRef {
pub ptr: AtomicPtr<KernelModel>, // Null = use heuristic fallback
}
impl AtomicModelRef {
/// Hot-path inference: load model pointer, run inference, drop RCU guard.
/// The rcu_read_guard prevents concurrent model replacement from freeing
/// the model while inference is in progress.
///
/// Calls `KernelModel::run_inference()` (see [Section 22.6](22-accelerators.md#in-kernel-inference-engine)),
/// which dispatches to the appropriate algorithm based on `model_type`:
/// `DecisionTree`, `LookupTable`, `LinearModel`, or `TinyNeuralNet`.
pub fn infer(&self, features: &[i32], output: &mut [i32]) -> Option<InferenceResult> {
let _guard = rcu_read_lock();
let model = unsafe { self.ptr.load(Acquire).as_ref()? };
if !model.active.load(Relaxed) { return None; }
let mut yielded = false;
Some(model.run_inference(features, output, &mut yielded))
}
/// Called from sysfs write handler on model.bin write.
/// Validates, then atomically replaces the active model.
///
/// **Serialization**: Callers must hold `MODEL_UPDATE_LOCK` (a global
/// `Mutex<()>` in the sysfs write handler) to prevent concurrent
/// `update()` calls from racing. Without the lock, two concurrent
/// writers could both validate, both `swap()`, and the second swap's
/// "old" pointer would be the first swap's "new" model — which gets
/// freed via `rcu_call_box_drop`, causing the winning model to be
/// freed while still active after the RCU grace period of the
/// losing update.
pub fn update(&self, new_model: Box<KernelModel>) -> Result<(), ModelError> {
// Caller must hold MODEL_UPDATE_LOCK.
new_model.validate()?;
let new_ptr = Box::into_raw(new_model);
let old_ptr = self.ptr.swap(new_ptr, AcqRel);
if !old_ptr.is_null() {
// Free old model after all CPUs pass through a quiescent state.
// Grace period bounds: see [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths).
// Typical grace period: <10ms under normal load; bounded by the
// longest RCU read-side critical section on any CPU.
// ([Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths)).
// SAFETY: old_ptr was allocated via Box::into_raw above.
unsafe { rcu_call_box_drop(old_ptr) };
}
Ok(())
}
}
23.1.8 Security and Capability Model¶
CAP_ML_TUNE capability. Applying PolicyUpdateMsg entries to the kernel requires
the Capability::KernelMlTune capability. This is an UmkaOS-specific capability (not a
Linux CAP_*), held by the Tier 2 service process when:
1. The operator has granted it at service registration time, OR
2. The service is one of UmkaOS's reference policy services and is cryptographically signed
Grant mechanism: KernelMlTune is granted to a Tier 2 service via
/etc/umka/ml-policy.d/<service>.toml:
[service.sched-advisor]
binary = "/usr/lib/umka/ml/sched-advisor"
signing_key_id = "sha3-256:abcd..."
grant_capabilities = ["KernelMlTune"]
cgroup_scope = "/system.slice" # limits parameter visibility
signing_key_id and the key is in the .kabi keyring
(Section 12.7), the capability is minted and placed in the
service's capability space with a scope constraint limiting it to cgroup_scope.
Without a matching config entry, the capability is not granted and the service cannot
write parameters.
Without KernelMlTune, a process can read observations from its own cgroup but cannot
write parameter updates.
Bounds enforcement. Every new_value in PolicyUpdateMsg is silently clamped to
[min_value, max_value] before being stored. An ML service that produces out-of-range
values is not rejected — the clamping is the safety mechanism.
Namespace isolation. A containerized Tier 2 service sees only its own cgroup's
parameters and observations. A service in cgroup docker/myapp cannot read memory
pressure observations from system.slice, nor can it set rapl_package_power_w globally.
Global parameter updates require CAP_ML_TUNE + CAP_SYS_ADMIN.
Audit log. Every parameter update is logged to the FMA ring (Section 20.1) with:
- {ts_ns, service_id, model_version, param_id, cgroup_id, old_value, new_value}
- Log entries are write-once; tampering with the audit ring requires CAP_SYS_ADMIN
- The FMA ring is accessible to security monitoring tools via umkafs /ukfs/kernel/mlaudit
Adversarial protection:
- Rate limiting: max 1000 PolicyUpdateMsg entries per service per second
- Consistency bounds: if a parameter oscillates by > 50% within 10s, an FMA alert
is raised (may indicate a misbehaving service or adversarial input to the ML model)
- Model versioning: models are refused if their validation accuracy is below 60% on
the standard benchmark set embedded in the kernel binary at build time.
Benchmark set definition (per model type):
| Model Type | Benchmark Inputs | Expected Output | Metric |
|---|---|---|---|
sched (task placement) |
1000 synthetic task arrival patterns (short-burst, sustained-CPU, IO-interleaved, mixed) with known-optimal CPU assignments from offline solver | Optimal CPU ID per task | Accuracy = correct placements / total |
memory (page tiering) |
100 memory access traces (sequential scan, random, hot-set, phase-change, zipfian) with known-optimal tier assignments | Tier ID per page | Accuracy = correct tier decisions / total |
tcp (congestion) |
200 flow traces (datacenter, WAN-lossy, satellite, wifi-variable) with known-optimal cwnd sequences from NS-3 simulation | cwnd within 20% of optimal | Accuracy = within-threshold predictions / total |
power (DVFS) |
50 CPU utilization traces with known Pareto-optimal frequency/power points | Frequency setting within 10% of optimal | Accuracy = within-threshold selections / total |
anomaly (FMA) |
100 metric streams (50 normal, 50 with injected anomalies: step change, drift, spike, flatline, periodic) | Correct anomaly/normal classification | F1 score ≥ 0.60 |
Benchmark inputs are compiled into the kernel as const byte arrays in
umka-core/src/ml/benchmark/ (one module per model type). Total
embedded data: ~2 MiB compressed with LZ4 (block format, not frame format).
LZ4 is chosen for its decompression speed (~4 GB/s on modern CPUs) and
minimal code footprint (~500 lines, no heap allocation required for
decompression). The kernel already includes LZ4 for zram and squashfs support.
Decompression: Benchmark data is decompressed lazily on first model validation (not at boot time). Each model type's benchmark module contains:
/// Compressed benchmark inputs for the scheduler model.
/// LZ4 block format, decompressed size: SCHED_BENCH_DECOMPRESSED_SIZE.
static SCHED_BENCH_COMPRESSED: &[u8] = include_bytes!("sched_benchmark.lz4");
const SCHED_BENCH_DECOMPRESSED_SIZE: usize = 524_288; // 512 KiB
sched benchmark with 1000 task patterns). Total peak memory during
validation: ~512 KiB (only one model type is validated at a time).
The buffer is freed immediately after validation completes.
The validation function runs the candidate model against each benchmark
input and computes the accuracy metric. Models failing the 60% threshold
are rejected with PolicyError::ValidationFailed.
23.1.8.1 Rate Limiter Implementation¶
/// Errors returned by the ML policy subsystem to Tier 2 services.
pub enum PolicyError {
/// Rate limiter rejected the message (too many submissions per second).
RateLimitExceeded,
/// Caller lacks `Capability::KernelMlTune`.
PermissionDenied,
/// Parameter ID not found in `KernelParamStore`.
UnknownParam { param_id: u32 },
// Values outside [min, max] are silently clamped (Section 23.1.8). No error is returned.
/// Service registration failed (duplicate service ID or table full).
RegistrationFailed,
/// Model validation failed (accuracy below 60% threshold or weight corruption).
ValidationFailed,
/// HMAC verification failed on a PolicyUpdateMsg.
HmacInvalid,
}
/// Registration record for a Tier 2 policy service.
/// Created during `ML_POLICY_REGISTER` ioctl. Holds the rate limiter
/// and a reference to the service's consumer vtable.
// Kernel-internal registration record. Not #[repr(C)], not KABI --
// layout is compiler-determined.
pub struct PolicyServiceRegistration {
/// Unique service identifier assigned at registration.
pub service_id: u32,
/// Token bucket rate limiter (one per service).
pub rate_limiter: PolicyServiceRateLimiter,
/// File descriptors for the service's shared-memory ring protocol.
pub obs_ring_fd: i32,
pub update_ring_fd: i32,
pub param_store_fd: i32,
/// PID of the registering process (for audit logging).
pub owner_pid: u32,
/// Per-registration HMAC session key (HMAC-SHA3-256). Generated by the
/// kernel during `ML_POLICY_REGISTER` and returned to the service in
/// `PolicyRegisterResponse.session_key`. The kernel retains this copy
/// to verify the HMAC on every incoming `PolicyUpdateMsg`.
pub session_key: [u8; 32],
}
Each registered Tier 2 service has a PolicyServiceRateLimiter embedded in its
PolicyServiceRegistration record. Enforcement happens before any validation or
dispatch of the PolicyUpdateMsg:
/// Token bucket rate limiter for PolicyUpdateMsg.
/// One per registered Tier 2 service. Cache-line aligned to prevent
/// false sharing when multiple services submit concurrently.
///
/// Uses a fractional nanosecond accumulator to avoid integer division
/// precision loss. The accumulator tracks elapsed nanoseconds that have
/// not yet been converted into tokens. When enough nanoseconds accumulate
/// to produce at least one whole token, they are converted and the
/// remainder is preserved. This gives exact long-term rates for any
/// tokens/sec value without requiring floating point or large intermediate
/// products.
#[repr(C, align(64))]
pub struct PolicyServiceRateLimiter {
/// Current token count (whole tokens, no scaling factor).
pub tokens: AtomicU32,
/// Explicit padding for AtomicU64 alignment (offset 4 -> 8).
pub _pad0: u32,
/// Nanosecond timestamp of last refill attempt.
pub last_refill_ns: AtomicU64,
/// Fractional nanosecond remainder from the last refill that was not
/// enough to produce a whole token. Accumulates across refill calls
/// so that sub-token intervals are never lost.
pub remainder_ns: AtomicU64,
/// Maximum token count (= burst capacity).
pub max_tokens: u32,
/// Explicit padding for u64 alignment (offset 28 -> 32).
pub _pad1: u32,
/// Nanoseconds required to produce one token (= 1_000_000_000 / rate).
/// Stored as the divisor to avoid per-refill division: we divide
/// elapsed_ns by this constant.
pub ns_per_token: u64,
// Total: 40 bytes of fields + 24 bytes align(64) tail padding = 64 bytes.
}
const_assert!(core::mem::size_of::<PolicyServiceRateLimiter>() == 64);
pub const POLICY_MSG_RATE_LIMIT_PER_SEC: u64 = 1_000; // steady-state limit
pub const POLICY_MSG_BURST: u32 = 200; // burst capacity
impl PolicyServiceRateLimiter {
pub const fn new() -> Self {
Self {
tokens: AtomicU32::new(POLICY_MSG_BURST),
_pad0: 0,
last_refill_ns: AtomicU64::new(0),
remainder_ns: AtomicU64::new(0),
max_tokens: POLICY_MSG_BURST,
_pad1: 0,
// 1_000_000_000 ns/sec / 1_000 tokens/sec = 1_000_000 ns per token.
// For any rate R tokens/sec, ns_per_token = 1_000_000_000 / R.
// This is exact for all rates that evenly divide 10^9 and provides
// < 1 ns/token error otherwise (absorbed by the remainder accumulator).
ns_per_token: 1_000_000_000 / POLICY_MSG_RATE_LIMIT_PER_SEC,
}
}
/// Attempt to consume one token. Lock-free.
///
/// Uses Relaxed ordering throughout: the token bucket provides approximate
/// rate limiting. Uses CAS loops for both refill and consume paths
/// to prevent wrapping underflow (which would completely bypass the
/// rate limiter). No inter-CPU visibility guarantees beyond the CAS.
pub fn try_consume(&self) -> Result<(), PolicyError> {
let now_ns = crate::arch::current::cpu::read_monotonic_ns();
// Refill: compute elapsed time, add remainder, convert to whole tokens.
// remainder_ns is best-effort; concurrent refills may grant OR lose
// O(1) tokens due to non-atomic read-modify-write on the remainder.
// This is acceptable: the rate limiter is advisory (prevents ML service
// flooding, not a security boundary). Token loss is bounded at 1 token
// per concurrent refill race.
let last = self.last_refill_ns.load(Relaxed);
let elapsed_ns = now_ns.saturating_sub(last);
let prev_remainder = self.remainder_ns.load(Relaxed);
let total_ns = elapsed_ns.saturating_add(prev_remainder);
let new_tokens = total_ns / self.ns_per_token;
if new_tokens > 0 {
let leftover_ns = total_ns % self.ns_per_token;
// The CAS on last_refill_ns serializes refill intervals.
// A concurrent caller with a slightly later timestamp may win
// a subsequent CAS, producing a second refill for the delta
// interval. This adds at most 1 extra token (delta <
// ns_per_token), bounded by max_tokens.
if self.last_refill_ns.compare_exchange(
last, now_ns, Relaxed, Relaxed
).is_ok() {
self.remainder_ns.store(leftover_ns, Relaxed);
// CAS loop for refill: cap at max_tokens, no lost tokens.
loop {
let cur = self.tokens.load(Relaxed);
let new = (cur + new_tokens as u32).min(self.max_tokens);
match self.tokens.compare_exchange_weak(cur, new, Release, Relaxed) {
Ok(_) => break,
Err(_) => continue,
}
}
}
}
// Consume one token via CAS loop — prevents wrapping underflow.
loop {
let cur = self.tokens.load(Acquire);
if cur == 0 {
return Err(PolicyError::RateLimitExceeded);
}
match self.tokens.compare_exchange_weak(cur, cur - 1, AcqRel, Relaxed) {
Ok(_) => return Ok(()),
Err(_) => continue, // Another thread consumed; retry
}
}
}
}
Enforcement point — the first check in policy_consumer_kabi_submit():
pub fn policy_consumer_kabi_submit(
svc: &PolicyServiceRegistration,
msg: PolicyUpdateMsg,
) -> Result<(), PolicyError> {
svc.rate_limiter.try_consume()?; // Reject immediately if over limit
validate_policy_update_msg(&msg)?; // Then validate
dispatch_to_subsystem(svc, msg) // Then dispatch
}
dispatch_to_subsystem definition:
/// Route a validated policy update to the target subsystem's parameter store.
/// The subsystem is identified by the `ParamId` encoding in `msg.param_id`
/// (bits [11:8] = subsystem index). Each subsystem registers a static
/// `PolicyApplyFn` at boot; this function looks it up and calls it.
///
/// Returns `Err(PolicyError::UnknownParam)` if the subsystem index
/// is out of range or has no registered handler.
fn dispatch_to_subsystem(
svc: &PolicyServiceRegistration,
msg: PolicyUpdateMsg,
) -> Result<(), PolicyError> {
// Convert raw u32 wire value to typed ParamId enum.
let param_id = ParamId::try_from_u32(msg.param_id)
.ok_or(PolicyError::UnknownParam { param_id: msg.param_id })?;
let subsystem = SubsystemId::from_param_id(param_id)
.ok_or(PolicyError::UnknownParam { param_id: msg.param_id })?;
let idx = subsystem as usize;
// POLICY_HANDLERS is a static array of Option<PolicyApplyFn>, one per
// SubsystemId, populated at boot by each subsystem's init code.
// Read via Acquire to see the handler written by the registering CPU.
let handler = POLICY_HANDLERS[idx]
.load(Ordering::Acquire)
.ok_or(PolicyError::UnknownParam { param_id: msg.param_id })?;
// Clamp the value to the parameter's registered [min, max] bounds.
let bounds = KERNEL_PARAM_STORE.bounds(param_id);
let clamped_value = msg.new_value.clamp(bounds.min, bounds.max);
// Apply the clamped value. The handler writes to the subsystem's
// per-cgroup or global parameter store (AtomicU32/AtomicI32).
handler(param_id, clamped_value, msg.cgroup_id)
}
/// Function pointer type for subsystem parameter application.
/// Called on the warm path (per policy update, not per syscall).
type PolicyApplyFn = fn(param: ParamId, value: i64, cgroup_id: u64) -> Result<(), PolicyError>;
/// Per-subsystem handler table. Indexed by `SubsystemId as usize`.
/// Each entry is set once at boot by the subsystem's init code.
///
/// **Implementation note**: `AtomicOption<fn(...)>` is implemented as
/// `AtomicPtr<()>` with null representing `None`. Function pointers
/// in Rust are pointer-sized thin pointers, so `AtomicPtr` can hold
/// them. The `load()` method returns `Option<fn(...)>` by converting
/// null → None, non-null → Some(transmute). This is sound because
/// all non-null values were stored from valid function pointers.
static POLICY_HANDLERS: [AtomicOption<PolicyApplyFn>; MAX_SUBSYSTEM_ID] =
[const { AtomicOption::none() }; MAX_SUBSYSTEM_ID];
/// Maximum SubsystemId discriminant + 1.
const MAX_SUBSYSTEM_ID: usize = 16;
Rejection behavior: The Tier 2 service receives PolicyError::RateLimitExceeded
(mapped to -EBUSY in the SysAPI layer). The service is responsible for backing off
(exponential backoff recommended). Repeated violations — more than 3 rate-limit errors
per second — trigger an FMA alert (category: PolicyConsumerMisbehaving, service_id).
23.1.9 Reference Policy Services¶
UmkaOS ships the following Tier 2 policy services (all optional, loaded on demand):
| Service | Model | Parameters tuned | Observations consumed | Cadence |
|---|---|---|---|---|
umka-ml-sched |
Gradient-boosted trees (XGBoost) | eevdf_weight_scale, migration_benefit_threshold, eas_energy_bias, resched_urgency |
TaskWoke, RunqueueStats, EasDecision |
Every 60s |
umka-ml-numa |
Gradient-boosted regression | numa_migration_threshold, swap_local_ratio |
NumaMigration, RefaultRecord, MemPressure |
Every 30s |
umka-ml-compress |
Random forest | compress_entropy_threshold, reclaim_aggressiveness |
PageFault, EvictionDecision, MemPressure |
Every 30s |
umka-ml-power |
Contextual bandit (online RL) | rapl_package_power_w, accel_power_cap_w, thermal_target_c |
EAS observations, accel utilization | Every 10s |
umka-ml-anomaly |
Isolation forest | anomaly_alert_threshold, health_check_interval_ms |
FmaHealth device metrics |
Every 5s |
umka-ml-tcp |
Online linear model | tcp_initial_cwnd_scale, bbr_probe_rtt_interval_ms |
CongestionEvent, FlowStats |
Every 5s |
umka-ml-intent |
RL / gradient-boosted | SchedIntentKpLatency, SchedIntentKdLatency, SchedIntentMaxAdjustment, SchedIntentHoldTimeMs |
RunqueueStats, IntentStatus, LatencyHistogram |
Every 60s |
Each service is a standalone Tier 2 driver (~500-2000 LOC) implementing the
policy ring protocol (PolicyRegisterRequest/PolicyRegisterResponse). They are independently upgradable, independently crashable
(Tier 2 restart in ~10ms, Section 11.2), and independently optional.
The umka-ml-sched service supports a "big model" extension point: if the environment
variable UMKA_ML_SCHED_REMOTE_ENDPOINT is set, it will call an external inference
service (Section 23.1.6) for low-confidence workload characterizations, falling back
to its local XGBoost model when the remote endpoint is unavailable.
23.1.10 Performance Impact¶
Observation emission (when at least one consumer is registered):
| Operation | Cycles | Notes |
|---|---|---|
| Static key check (disabled) | 1–3 | Predicted-not-taken branch |
| TSC read | 5–10 | RDTSC or arch-specific |
| Ring buffer write | 10–20 | One cache line write |
| Total (enabled) | ~25 cycles | ~10ns at 2.5 GHz |
Parameter read (hot path):
| Operation | Cycles | Notes |
|---|---|---|
AtomicI64::load(Relaxed) |
1–3 | Same cost as reading a global variable |
Parameter update (Tier 2 → kernel):
| Operation | Cycles | Notes |
|---|---|---|
| Ring buffer submission | ~20 | From Tier 2 userspace |
| Kernel validation + CAS | ~15 | Including capability check |
| Total end-to-end (async) | ~100μs | Dominated by scheduling latency for async path |
Decay enforcement (once per second, not per tick):
The enforce_param_decay scan visits all registered parameters once per second (guarded
by the decay_due flag set by a 1 Hz periodic timer). With 50 parameters: ~50 comparisons
per second per CPU. At 3 cycles per comparison: ~150 cycles/s = ~60ns/s. Negligible.