Skip to content

Sections 39–41 of the ISLE Architecture. For the full table of contents, see README.md.


Part X: Observability and Diagnostics

Fault management, tracepoints, and the unified object namespace.


39. Fault Management Architecture

Inspired by: Solaris/illumos FMA. IP status: Clean — generic engineering concepts, clean-room implementation from public principles.

39.1 Problem

Hardware fails gradually. ECC memory corrects bit flips before they become fatal. NVMe drives report wear leveling and reallocated sectors. PCIe links log correctable errors. Network interfaces track CRC errors. These are warning signals.

Linux largely ignores them. Userspace tools (mcelog, smartctl, rasdaemon) scrape ad-hoc kernel interfaces. There is no unified kernel-level framework for collecting hardware telemetry, diagnosing trends, and taking corrective action before a crash.

ISLE already has crash recovery (Section 9). FMA extends this from reactive ("driver crashed, reload it") to proactive ("this DIMM is degrading, retire its pages before data corruption occurs").

39.2 Architecture

+------------------------------------------------------------------+
|                     FAULT MANAGEMENT ENGINE                       |
|                     (ISLE Core, kernel-internal)                  |
|                                                                  |
|  Telemetry        Diagnosis         Response                     |
|  Collector  --->  Engine      --->  Executor                     |
|                                                                  |
|  - Per-device     - Rule-based      - Retire memory pages        |
|    health buffers - Threshold        - Demote driver tier         |
|  - Ring buffer      detection       - Disable device             |
|    (lock-free)   - Correlation       - Alert via printk/uevent   |
|  - NUMA-aware      (multi-signal)   - Migrate I/O (if possible) |
+------------------------------------------------------------------+
         ^                                       |
         |  Health reports via KABI              |  Actions via registry
         |                                       v
+------------------+                   +-------------------+
| Device Drivers   |                   | Device Registry   |
| (Tier 0/1/2)    |                   | (Section 7)       |
+------------------+                   +-------------------+

39.3 Telemetry Collection

Drivers report health data to the kernel through a new KABI method:

// Appended to KernelServicesVTable (Option<...> for backward compat)

/// Report a health telemetry event.
pub fma_report_health: Option<unsafe extern "C" fn(
    device_handle: DeviceHandle,
    event_class: HealthEventClass,
    event_code: u32,
    severity: HealthSeverity,
    data: *const u8,
    data_len: u32,
) -> IoResultCode>,
/// Health event classification.
#[repr(u32)]
pub enum HealthEventClass {
    /// Memory: ECC corrected error, uncorrectable error, scrub result.
    Memory          = 0,
    /// Storage: SMART attribute change, wear level, reallocated sector.
    Storage         = 1,
    /// Network: CRC errors, link flaps, packet drops.
    Network         = 2,
    /// PCIe: Correctable error, AER event, link retraining.
    Pcie            = 3,
    /// Thermal: Over-temperature warning, throttling.
    Thermal         = 4,
    /// Power: Voltage out of range, power supply degradation.
    Power           = 5,
    /// Generic: Driver-defined health event.
    Generic         = 6,
    /// Accelerator: GPU, FPGA, or other accelerator health event.
    Accelerator     = 7,
}

impl HealthEventClass {
    /// Number of variants in this enum.  Used to size arrays indexed by
    /// event class, avoiding hardcoded constants.
    pub const COUNT: usize = 8;
}

#[repr(u32)]
pub enum HealthSeverity {
    /// Informational: no action needed, for trending.
    Info            = 0,
    /// Warning: threshold approaching, admin should investigate.
    Warning         = 1,
    /// Degraded: component partially failed, corrective action taken.
    Degraded        = 2,
    /// Critical: imminent failure, immediate action required.
    Critical        = 3,
}

The data field carries event-specific payload. Standard payload formats per class:

/// Memory health payload.
#[repr(C)]
pub struct MemoryHealthData {
    /// Physical address of the error (0 = unknown).
    pub phys_addr: u64,
    /// DIMM identifier (SMBIOS handle or ACPI proximity).
    pub dimm_id: u32,
    /// Error type: 0 = correctable, 1 = uncorrectable.
    pub error_type: u32,
    /// ECC syndrome (for diagnosis).
    pub syndrome: u64,
    /// Cumulative correctable error count for this DIMM.
    pub cumulative_ce_count: u64,
}

/// Storage health payload.
#[repr(C)]
pub struct StorageHealthData {
    /// SMART attribute ID.
    pub attribute_id: u8,
    /// Current value.
    pub current: u8,
    /// Worst recorded value.
    pub worst: u8,
    /// Threshold for failure.
    pub threshold: u8,
    pub _pad_align: [u8; 4], // Explicit padding for u64 alignment (repr(C))
    /// Raw attribute value (vendor-specific).
    pub raw_value: u64,
    /// Percentage life remaining (0-100, 0xFF = unknown).
    pub life_remaining_pct: u8,
    pub _pad: [u8; 7],
}

/// Network health payload.
#[repr(C)]
pub struct NetworkHealthData {
    /// Interface index (matches the device registry's interface ID).
    pub if_index: u32,
    /// CRC error count since last report.
    pub crc_errors: u32,
    /// Link flap count since last report.
    pub link_flaps: u32,
    /// Packet drop count (RX + TX) since last report.
    pub packet_drops: u32,
    /// Current link speed in Mbps (0 = link down).
    pub link_speed_mbps: u32,
    /// Link state: 0 = down, 1 = up.
    pub link_up: u8,
    pub _pad: [u8; 3],
}

/// Thermal health payload.
#[repr(C)]
pub struct ThermalHealthData {
    /// Current temperature in millidegrees Celsius (e.g., 72500 = 72.5 C).
    pub temp_millicelsius: i32,
    /// Thermal throttling threshold in millidegrees Celsius.
    pub throttle_threshold_mc: i32,
    /// Critical shutdown threshold in millidegrees Celsius.
    pub critical_threshold_mc: i32,
    /// Whether the device is currently throttled (0 = no, 1 = yes).
    pub throttled: u8,
    /// Thermal zone identifier (device-specific).
    pub zone_id: u8,
    pub _pad: [u8; 2],
}

/// PCIe health payload.
#[repr(C)]
pub struct PcieHealthData {
    /// BDF (bus:device.function).
    pub bdf: u32,
    /// AER correctable error status register.
    pub cor_status: u32,
    /// AER uncorrectable error status register.
    pub uncor_status: u32,
    /// Current link speed (GT/s * 10, e.g., 80 = 8.0 GT/s).
    pub link_speed: u16,
    /// Current link width (x1, x4, x8, x16).
    pub link_width: u8,
    /// Link retraining count.
    pub retrain_count: u8,
}

39.4 Telemetry Buffer

The FMA engine maintains a per-device circular buffer of health events:

// Kernel-internal

pub struct DeviceHealthLog {
    /// Device node this log belongs to.
    device_id: DeviceNodeId,

    /// Circular buffer of recent events (fixed size per device).
    events: CircularBuffer<HealthEvent, 256>,

    /// Counters by event class (for fast threshold checks).
    /// Size is derived from the enum variant count, not hardcoded.
    class_counts: [AtomicU64; HealthEventClass::COUNT],

    /// Timestamp of first event in current window (for rate detection).
    window_start_ns: u64,

    /// NUMA node (allocate buffer on device's NUMA node).
    numa_node: i32,

    /// Maximum events per second per (device, event_class) pair.
    /// Events exceeding this rate are counted but not stored in the
    /// circular buffer. Default: 100.
    rate_limit_per_sec: u32,

    /// Number of events suppressed by rate limiting since last reset.
    /// When non-zero, a single "rate_limited" meta-event is recorded
    /// in the circular buffer with the suppressed count.
    suppressed_count: AtomicU64,
}

#[repr(C)]
pub struct HealthEvent {
    pub timestamp_ns: u64,
    pub class: HealthEventClass,
    pub code: u32,
    pub severity: HealthSeverity,
    pub data: [u8; 64],    // Inline payload (avoids allocation)
    pub data_len: u32,
}

Backpressure design (prevents event storms from overwhelming the telemetry path):

The ingestion path uses a two-level design: - Fast path: Atomic counter increment per (device, event_class) pair. Zero allocation. Every event is counted in class_counts regardless of rate. - Slow path: Detailed event stored in the circular buffer only if the rate is below rate_limit_per_sec (default: 100 events/second per class). Above the threshold, a single "rate_limited" meta-event is recorded with the suppressed_count value, then the counter resets. This ensures the buffer contains representative events without being flooded during error storms.

39.5 Diagnosis Engine

The diagnosis engine is a rule-based evaluator that runs when new telemetry arrives.

// Kernel-internal

pub struct DiagnosisRule {
    /// Human-readable name (for logging).
    name: ArrayString<64>,

    /// Match criteria.
    event_class: HealthEventClass,
    min_severity: HealthSeverity,

    /// Threshold: number of matching events within time window.
    count_threshold: u32,
    window_seconds: u32,

    /// Rate: events per second sustained over window.
    rate_threshold: Option<u32>,

    /// Value-based threshold: fires when a specific field in the health event payload
    /// falls below (or exceeds) a specified level, independent of event counts.
    ///
    /// For example, NVMe wear rules ("life < 10%") use:
    ///   field_id = DiagField::LifeRemainingPct, threshold = 10
    /// The diagnosis engine reads the field identified by `field_id` from the health
    /// event payload via the `DiagFieldAccessor` trait and fires the rule when the
    /// field's value drops below `threshold`.
    /// When `None`, only `count_threshold` / `rate_threshold` are evaluated.
    value_threshold: Option<ValueThreshold>,

    /// Correlation: require events from multiple related devices.
    correlation: Option<CorrelationRule>,

    /// Action to take when rule fires.
    action: DiagnosisAction,
}

/// Enum-based field identifiers for scalar values in health event payloads.
/// Used instead of string-based field names because Rust has no runtime
/// reflection -- struct fields don't carry name metadata at runtime. Each
/// health event payload type implements `DiagFieldAccessor` to map these
/// enum variants to the actual struct field values.
#[repr(u16)]
pub enum DiagField {
    /// NVMe endurance: remaining drive life as a percentage (0-100).
    LifeRemainingPct    = 0,
    /// Current temperature in millidegrees Celsius.
    Temperature         = 1,
    /// Cumulative correctable error count.
    CorrectableErrors   = 2,
    /// Cumulative uncorrectable error count.
    UncorrectableErrors = 3,
    /// Available spare capacity as a percentage (NVMe).
    AvailableSparePct   = 4,
    /// Power-on hours.
    PowerOnHours        = 5,
    /// Media errors (NVMe).
    MediaErrors         = 6,
    /// Memory correctable error count (DIMM health).
    MemoryCorrectableErrors = 7,
}

/// Trait implemented by each health event payload type (e.g., `StorageHealthData`,
/// `MemoryHealthData`). Maps `DiagField` variants to the corresponding scalar
/// value in the payload struct. Returns `None` if the field is not applicable
/// to this payload type (e.g., `LifeRemainingPct` on a DIMM health event).
pub trait DiagFieldAccessor {
    /// Extract the scalar value for the given field, or `None` if not applicable.
    fn field_value(&self, field: DiagField) -> Option<u64>;
}

/// Threshold applied to a specific scalar field in the health event payload.
/// Used for percentage- or gauge-based rules (e.g., NVMe wear life).
pub struct ValueThreshold {
    /// Identifies which field in the health event payload to compare.
    /// E.g., `DiagField::LifeRemainingPct` for NVMe endurance data.
    field_id: DiagField,

    /// Numeric trigger level. The rule fires when the field's value is
    /// strictly less than this value (e.g., `10` means "< 10%").
    threshold: u32,
}

pub enum DiagnosisAction {
    /// Log a message and emit uevent. No automatic correction.
    Alert { message: ArrayString<128> },

    /// Retire specific physical pages (memory errors).
    RetirePages,

    /// Demote the device's driver to a lower tier.
    DemoteTier,

    /// Disable the device entirely.
    DisableDevice,

    /// Mark device as degraded (informational, for admin).
    MarkDegraded,

    /// Trigger live evolution (§52) to proactively replace a degrading component.
    TriggerEvolution { target_component: ArrayString<64> },
}

/// Used by the diagnosis engine (Section 39.5) to detect correlated failures
/// across multiple related devices. When a DiagnosisRule includes a CorrelationRule,
/// the engine evaluates the threshold across all devices sharing the specified
/// property (e.g., same memory controller, same PCIe root complex, same NUMA node).
/// The event correlation engine (Section 41, Health namespace) uses these rules to
/// populate cross-device fault reports visible under /Health/ByDevice/.
pub struct CorrelationRule {
    /// Require events from N distinct devices sharing this property.
    /// Example: "memory_controller" — multiple DIMMs on the same controller,
    /// "pcie_root" — multiple devices on the same PCIe root complex.
    shared_property: ArrayString<32>,
    min_devices: u32,
}

Default rules (built-in, administrator can override via /sys):

Rule Class Threshold Window Action
DIMM degradation Memory 100 CE 1 hour Alert + RetirePages
DIMM failure Memory 1 UE instant DisableDevice + Alert
NVMe wear out Storage life < 10% Alert
NVMe critical wear Storage life < 3% MarkDegraded + Alert
PCIe link unstable PCIe 10 retrains 1 minute Alert
PCIe link failing PCIe 50 retrains 1 minute DemoteTier + Alert
NIC error storm Network 1000 CRC errors 1 minute Alert
Thermal throttling Thermal 5 events 10 minutes Alert
PCIe proactive swap PCIe 30 retrains 10 minutes TriggerEvolution + Alert
NVMe proactive swap Storage life < 5% TriggerEvolution + Alert

39.6 Response Executor

Actions integrate with existing kernel subsystems:

RetirePages: The memory manager (Section 12) is asked to remove specific physical pages from the buddy allocator. Any process mapping those pages is transparently migrated to a replacement page (copy-on-read). This is how Linux handles hardware-poisoned pages (memory_failure()), but triggered proactively.

DemoteTier: The device registry (Section 7) transitions the device's driver to a lower isolation tier. This uses the existing crash-recovery reload mechanism but without a crash — clean stop, change tier, restart.

DisableDevice: The registry transitions the device to Error state. The driver is stopped. The device node remains in the tree (for introspection) but accepts no I/O.

TriggerEvolution: The response executor invokes the Live Evolution framework (Section 52) to proactively hot-swap a degrading component before it fails. Example: FMA detects a pattern of increasing PCIe correctable errors on a bus serving a Tier 1 NIC driver. The diagnosis engine fires, and instead of waiting for a hard failure, the response executor triggers §52's component replacement flow to swap the NIC driver to a degraded-mode variant (e.g., conservative I/O scheduler, reduced-bandwidth path). The replacement follows the same state serialization and quiescence protocol as any live evolution (§52.3), but is initiated automatically by FMA rather than by an administrator. This closes the gap between reactive fault handling and proactive system evolution — the kernel treats impending hardware failure as a trigger for self-repair rather than merely an alert.

39.7 Linux Interface Exposure

Entirely through standard mechanisms:

sysfs (per-device, under the device registry's sysfs tree):

/sys/devices/.../health/
    status          # "ok", "warning", "degraded", "critical"
    events_total    # Total health events received
    ce_count        # Correctable error count (memory)
    ue_count        # Uncorrectable error count (memory)
    life_remaining  # Percentage (storage)
    link_retrains   # Count (PCIe)
    temperature     # Current (thermal)

procfs:

/proc/isle/fma/
    rules           # Current diagnosis rules (read/write for admin)
    events          # Recent events across all devices (ring buffer dump)
    retired_pages   # List of retired physical pages
    statistics      # Aggregate counters

uevent: Standard hotplug mechanism for pushing notifications to userspace.

ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:1f.2
SUBSYSTEM=pci
ISLE_HEALTH=degraded
ISLE_HEALTH_REASON=pcie_link_unstable

dmesg/printk: Standard kernel log for all alerts.

Existing tools that benefit without modification: - rasdaemon — uses kernel tracepoints (specifically the RAS tracepoints in /sys/kernel/debug/tracing/events/ras/) to collect hardware error events and stores them in a SQLite database; ISLE's sysfs health attributes provide an additional data source that rasdaemon can be extended to consume - Prometheus node_exporter — can scrape sysfs files - smartctl — storage health still works via the standard interface - systemd — can react to uevents


40. Stable Tracepoint ABI

Inspired by: DTrace's philosophy of always-on, zero-cost, production-grade tracing. IP status: Clean — design decisions applied to eBPF (Linux mechanism), not DTrace code. Tracepoints as stable interfaces is a policy choice, not a patentable invention.

40.1 Problem

Linux tracepoints are considered unstable internal API. They change name, arguments, and semantics between kernel versions. Tools like bpftrace and BCC scripts break regularly. The community's position: "tracepoints are for kernel developers, not users."

This is a policy choice, not a technical necessity. ISLE can make a different choice.

Section 20.4 specifies full eBPF support. This section defines a layer on top: a set of versioned, stable, documented tracepoints that applications can depend on across kernel updates.

40.2 Two Categories of Tracepoints

Category 1: Stable Tracepoints (isle_tp_stable_*) — Versioned, documented, covered by the same "never break without deprecation" policy as the userspace ABI. These are the tracepoints that monitoring tools, profilers, and production observability systems can depend on.

Category 2: Debug Tracepoints (isle_tp_debug_*) — Unstable, may change between any two releases. For kernel developer use only. Same policy as Linux tracepoints.

40.3 Stable Tracepoint Interface

// isle-core/src/trace/stable.rs (kernel-internal)

/// A stable tracepoint definition.
pub struct StableTracepoint {
    /// Tracepoint name (e.g., "isle_tp_stable_syscall_entry").
    /// Once published, this name never changes.
    pub name: &'static str,

    /// Category for organization.
    pub category: TracepointCategory,

    /// Version of this tracepoint's argument format.
    /// Starts at 1. Bumped when arguments are added (append-only).
    pub version: u32,

    /// Argument schema (for bpftool and documentation).
    pub args: &'static [TracepointArg],

    /// Probe function pointer (None when no eBPF program is attached).
    /// When None, the tracepoint has ZERO overhead (branch on static key).
    pub probe: AtomicPtr<()>,
}

#[repr(u32)]
pub enum TracepointCategory {
    Syscall     = 0,    // Syscall entry/exit
    Scheduler   = 1,    // Context switch, wakeup, migration
    Memory      = 2,    // Page fault, allocation, reclaim, compression
    Block       = 3,    // Block I/O submit, complete
    Network     = 4,    // Packet TX/RX, socket events
    Filesystem  = 5,    // VFS operations, page cache
    Driver      = 6,    // Driver load, unload, crash, recovery
    Power       = 7,    // PM transitions, frequency changes
    Security    = 8,    // Capability checks, access denials
    Fma         = 9,    // Health events, diagnosis actions
}

pub struct TracepointArg {
    pub name: &'static str,
    pub arg_type: TracepointArgType,
    pub description: &'static str,
}

#[repr(u32)]
pub enum TracepointArgType {
    U64         = 0,
    I64         = 1,
    U32         = 2,
    I32         = 3,
    Str         = 4,    // Pointer + length
    Bytes       = 5,    // Pointer + length
    Pid         = 6,    // Process ID (u32)
    Tid         = 7,    // Thread ID (u32)
    DeviceId    = 8,    // DeviceNodeId (u64)
    Timestamp   = 9,    // Nanoseconds since boot (u64)
}

40.4 Zero-Overhead When Disabled

Tracepoints use static keys (same mechanism as Linux). When no eBPF program is attached, the tracepoint site is a NOP instruction. There is zero overhead in the common case — no branch prediction cost, no cache pollution from tracepoint data collection. (Note: always-on aggregation counters in Section 40.7 add ~1-2 ns per event via per-CPU atomic increments, independent of tracepoint enablement.)

// At the tracepoint site in kernel code:

// This compiles to a NOP when no probe is attached.
// When a probe is attached, it becomes a call.
isle_trace_stable!(syscall_entry, {
    pid: current_pid(),
    tid: current_tid(),
    syscall_nr: nr,
    arg0: args[0],
    arg1: args[1],
    arg2: args[2],
});

When an eBPF program attaches to this tracepoint, the runtime patches the NOP to a CALL instruction (instruction patching, same as Linux static keys). When the program detaches, it's patched back to NOP.

40.5 Stable Tracepoint Catalog

Initial set of stable tracepoints (version 1):

Syscall:

Tracepoint Arguments Description
isle_tp_stable_syscall_entry pid, tid, nr, arg0-5, timestamp Syscall entry
isle_tp_stable_syscall_exit pid, tid, nr, ret, timestamp Syscall return

Scheduler:

Tracepoint Arguments Description
isle_tp_stable_sched_switch prev_pid, prev_tid, next_pid, next_tid, prev_state, cpu Context switch
isle_tp_stable_sched_wakeup pid, tid, target_cpu, timestamp Task wakeup
isle_tp_stable_sched_migrate pid, tid, orig_cpu, dest_cpu, timestamp Task migration

Memory:

Tracepoint Arguments Description
isle_tp_stable_page_fault pid, address, flags, timestamp Page fault entry
isle_tp_stable_page_alloc order, gfp_flags, numa_node, timestamp Page allocation
isle_tp_stable_page_reclaim nr_reclaimed, nr_scanned, priority Reclaim cycle
isle_tp_stable_zpool_compress original_size, compressed_size, algorithm Page compressed
isle_tp_stable_zpool_decompress compressed_size, latency_ns Page decompressed

Block I/O:

Tracepoint Arguments Description
isle_tp_stable_block_submit device_id, sector, size, op, timestamp I/O submitted
isle_tp_stable_block_complete device_id, sector, size, op, latency_ns, error I/O completed

Network:

Tracepoint Arguments Description
isle_tp_stable_net_rx device_id, len, protocol (ethertype, u16 network byte order), timestamp Packet received
isle_tp_stable_net_tx device_id, len, protocol (ethertype, u16 network byte order), timestamp Packet transmitted
isle_tp_stable_tcp_connect pid, saddr, sport, daddr, dport TCP connection
isle_tp_stable_tcp_close pid, saddr, sport, daddr, dport, duration_ns TCP close

Driver:

Tracepoint Arguments Description
isle_tp_stable_driver_load device_id, driver_name, tier, timestamp Driver loaded
isle_tp_stable_driver_crash device_id, driver_name, tier, fault_type Driver crash
isle_tp_stable_driver_recover device_id, driver_name, recovery_time_ns Driver recovered
isle_tp_stable_tier_demote device_id, driver_name, old_tier, new_tier Tier demotion

FMA:

Tracepoint Arguments Description
isle_tp_stable_fma_event device_id, class, severity, code Health event
isle_tp_stable_fma_action device_id, action, reason Corrective action taken

40.6 Versioning Rules

Same philosophy as KABI:

  1. Tracepoint names are permanent. Once published, a stable tracepoint name never changes and never disappears (without multi-release deprecation).
  2. Arguments are append-only. New arguments can be added at the end. Existing arguments never change position, type, or semantics.
  3. Version field tracks argument schema. eBPF programs can check the tracepoint version to know which arguments are available.
  4. Deprecation requires 2+ major releases. A deprecated tracepoint continues to fire (possibly with stale/dummy data) for at least two major releases before removal.

40.7 Built-In Aggregation Maps

For common observability patterns, provide pre-built eBPF maps that aggregate in-kernel:

/sys/kernel/isle/trace/
    syscall_latency_hist    # Per-syscall latency histogram (log2 buckets)
    block_latency_hist      # Per-device I/O latency histogram
    sched_latency_hist      # Scheduling latency histogram
    net_packet_count        # Per-interface packet counters
    page_fault_count        # Per-process page fault counters

These are always-on aggregation counters, separate from the tracepoint mechanism described in Section 40.4. The distinction is important:

  • Tracepoints (Section 40.4) have zero overhead when disabled. When no eBPF program is attached, the tracepoint site is a NOP instruction — no branch, no cost. They are designed for on-demand, deep inspection.
  • Aggregation counters are always active but are not tracepoint-based. They are simple atomic increments (e.g., AtomicU64::fetch_add(1, Relaxed)) embedded directly in the relevant code paths (syscall entry, block I/O completion, packet RX/TX, etc.). Their cost is a single atomic increment per event — typically 1-2 ns — which is negligible compared to the operation being measured. The histogram buckets are updated in-kernel with no eBPF program required.

Aggregation counters are per-CPU (PerCpu<AggregationCounters>), avoiding cross-core cache line bouncing. The per-CPU design ensures fetch_add operations are cache-local — no lock prefix needed on x86 when the counter is CPU-local.

Observability tools (Prometheus, Grafana agents, etc.) can scrape the aggregation counter files directly without enabling any tracepoints.

40.8 Linux Tool Compatibility

bpftrace: Works unmodified. Stable tracepoints appear as standard tracepoints:

bpftrace -e 'tracepoint:isle_stable:syscall_entry { @[args->nr] = count(); }'

perf: Works unmodified. Tracepoints visible via perf list:

perf list 'isle_stable:*'
perf record -e isle_stable:block_complete -a sleep 10

bpftool: Works unmodified. Can list and inspect programs attached to stable tracepoints.

40.9 Audit Subsystem

ISLE provides a structured audit subsystem for security-relevant events, built on the tracepoint infrastructure from sections 40.1–40.8. Unlike Linux's auditd — a separate subsystem with its own filtering language, netlink protocol, and dispatcher daemon — ISLE's audit is integrated directly with the capability model and eBPF tracepoints. One mechanism serves both observability and security auditing, operating in one of two explicitly distinct delivery modes:

  • Best-effort mode (default): Audit record emission is always non-blocking. If the per-CPU ring buffer is full, the oldest unread record is overwritten. No thread ever blocks. System availability is prioritized over audit completeness.
  • Strict mode (audit_strict=true boot parameter): Audit record emission blocks with a configurable timeout (default 10 ms) when the ring buffer is full, applying backpressure to maintain gapless per-CPU sequences. This is NOT contradictory with the non-blocking hot path -- in strict mode, blocking occurs only on buffer overflow (a rare backpressure event), not on every record emission.

These modes are detailed in Section 40.9.2 (delivery semantics) and Section 40.9.3 (overflow policy).

40.9.1 Security Audit Events

Audited events fall into six categories. All events in the Capability and Authentication categories are audited by default; other categories follow configurable policy.

Category Events Example
Capability grant, revoke, attenuate, delegate Process A grants CAP_NET to process B
Authentication login, logout, failed auth SSH login attempt from 10.0.0.1
Access control permission denied, capability check Open /etc/shadow: CAP_READ denied
Process lifecycle exec, exit, setuid-equiv Process 1234 exec /usr/bin/sudo
Driver load, crash, tier change NVMe driver crashed, recovering
Configuration sysctl change, policy load Security policy reloaded

Each event fires a stable tracepoint in the Security category (TracepointCategory, section 40.3), so the same eBPF tooling works for security monitoring. Audit tracepoints are non-blocking on the hot path: they enqueue the audit record into a per-CPU lock-free ring buffer (the same zero-overhead static-key mechanism as all other tracepoints). The delivery guarantee is enforced asynchronously by the drain thread and the configured delivery mode (Section 40.9.2). In strict mode, the producer blocks only when the ring buffer is full (backpressure), not on every record emission. In best-effort mode (the default), the oldest unread record is overwritten and no thread ever blocks.

40.9.2 Audit Record Format

/// Audit event type discriminator.
///
/// ISLE audit event IDs use a base of 3000 (`ISLE_AUDIT_BASE`) to avoid collision
/// with all Linux audit message type ranges. Linux's `audit.h` allocates:
///   - 1000-2099: kernel messages (commands, events, anomalies, integrity, etc.)
///   - 2100-2999: userspace messages (`AUDIT_FIRST_USER_MSG2` .. `AUDIT_LAST_USER_MSG2`)
/// By starting at 3000, ISLE IDs are above all currently allocated Linux ranges.
/// The Linux compatibility layer (Section 40.9.5) translates ISLE audit events to
/// standard Linux audit message types (e.g., `AUDIT_AVC`, `AUDIT_USER_AUTH`) for
/// tools like auditd and ausearch.
///
/// Base constant: `ISLE_AUDIT_BASE = 3000`.
#[repr(u16)]
pub enum AuditEventType {
    /// Capability grant (cap_id, target_pid, permissions).
    CapGrant        = 3000,
    /// Capability denial (cap_id, requested_permissions, reason).
    CapDeny         = 3001,
    /// Capability revocation (cap_id, holder_pid).
    CapRevoke       = 3002,
    /// Process lifecycle (fork, exec, exit).
    ProcessLifecycle = 3010,
    /// Security policy change (module load/unload, policy update).
    PolicyChange    = 3020,
    /// Driver isolation event (tier change, crash, recovery).
    DriverIsolation = 3030,
    /// Authentication event (login, sudo, key use).
    Auth            = 3040,
    /// Administrative action (sysctl, mount, module load).
    AdminAction     = 3050,
    /// Audit drain thread started (meta-event, Section 40.9.3).
    /// Emitted by the audit subsystem's initialization code to make
    /// the drain thread's lifecycle observable even though its internal
    /// operations are not individually audited (to prevent self-deadlock).
    AuditDrainStart = 3060,
    /// Audit drain thread stopped (meta-event, Section 40.9.3).
    AuditDrainStop  = 3061,
    /// Batch HMAC seal marker (Section 40.9.4).
    HmacBatchSeal   = 3062,
}

/// A single audit record. Fixed-size header followed by variable-length detail.
/// Written atomically to the per-CPU audit ring buffer.
///
/// Field order is chosen for natural alignment with `#[repr(C)]`: all u64 fields
/// first, then u32 fields, then u16 fields, then u8 fields, with explicit padding
/// to avoid compiler-inserted holes. Total header size: 56 bytes (no hidden padding).
#[repr(C)]
pub struct AuditRecord {
    // --- 8-byte aligned fields (offset 0) ---

    /// Monotonic nanosecond timestamp (same clock as tracepoints).
    pub timestamp_ns: u64,
    /// Per-CPU monotonically increasing sequence number. Each CPU maintains
    /// its own independent sequence counter. In strict mode, sequences are
    /// gapless (producers block on overflow); in best-effort mode, gaps
    /// indicate lost records and are detected via sequence discontinuities.
    /// On drain, the merge-sort thread interleaves per-CPU chains and
    /// produces a combined audit log with (cpu_id, per_cpu_sequence)
    /// tuples for total ordering reconstruction.
    pub sequence: u64,
    /// Capability handle used (or attempted) for this operation.
    /// `CapHandle` is a 64-bit opaque handle indexing into the process's
    /// capability table (Section 11.1, isle-core capability model). User space
    /// holds handles, never raw capability data. Defined as:
    /// `pub struct CapHandle(u64);`
    pub subject_cap: CapHandle,
    /// Opaque identifier for the target resource (inode, device, endpoint).
    pub object_id: u64,

    // --- 4-byte aligned fields (offset 32) ---

    /// Process ID of the subject.
    pub pid: u32,
    /// Thread ID of the subject.
    pub tid: u32,
    /// User ID of the subject (mapped from capability-based identity).
    pub uid: u32,

    // --- 2-byte aligned fields (offset 44) ---

    /// CPU that emitted this record. Together with `sequence`, provides a
    /// globally unique, tamper-evident record identifier.
    pub cpu_id: u16,
    /// The audit event type (which category + specific event).
    pub event_type: AuditEventType,
    /// Length of the variable-length detail payload that follows.
    pub detail_len: u16,

    // --- 1-byte aligned fields (offset 50) ---

    /// Outcome of the audited operation.
    pub result: AuditResult,

    /// Explicit padding to 8-byte alignment boundary (56 bytes total).
    /// Prevents compiler-inserted hidden padding in `#[repr(C)]`.
    pub _pad: [u8; 5],

    // Variable-length detail follows: structured key=value pairs
    // encoded as length-prefixed UTF-8 strings.
}

#[repr(u8)]
pub enum AuditResult {
    /// Operation succeeded.
    Success = 0,
    /// Operation denied by capability check.
    Denied  = 1,
    /// Operation failed for a non-security reason.
    Error   = 2,
}

Records are written to a per-CPU lock-free ring buffer, then drained to persistent storage by a dedicated kernel audit thread. Each CPU maintains its own monotonic sequence counter without requiring a global atomic counter (which would defeat per-CPU parallelism). The drain thread merges and sorts per-CPU chains by timestamp before writing to the audit log, producing a combined record stream with (cpu_id, per_cpu_sequence) tuples for total ordering.

Timestamp ordering: The merge-sort uses (timestamp_ns, cpu_id, per_cpu_sequence) as the composite sort key — timestamp is primary, with CPU ID and per-CPU sequence as tiebreakers. TSC synchronization (Section 16a) bounds cross-CPU skew to <100ns on modern hardware. For systems with larger skew (NUMA, pre-invariant-TSC), the drain thread applies a per-CPU offset correction table calibrated at boot.

Variable-length record framing with overwrite safety. Each AuditRecord has a fixed-size 56-byte header followed by a variable-length detail payload (length given by detail_len). Since records vary in total size, the ring buffer uses a composed framing and overwrite-safety protocol. Each ring slot has the following byte layout:

+-------------+-------------+-------------------------------------------+------------+
| start_seq   | length      | AuditRecord header + detail payload       | end_seq    |
| (u32, 4B)   | (u32, 4B)   | (56 + detail_len bytes)                   | (u32, 4B)  |
+-------------+-------------+-------------------------------------------+------------+
 ^                           ^                                           ^
 Overwrite safety            Framing (record boundary)                   Overwrite safety

Framing protocol (record boundary detection):

  • The length field (second u32) stores the total size of the framed entry: 4 (start_seq) + 4 (length) + 56 (header) + detail_len (payload) + 4 (end_seq).
  • When a record would wrap around the end of the ring buffer (remaining space < total framed size), a skip marker is written: a slot where length == 0. This signals the consumer to skip the remaining bytes and continue from offset 0. The record is then written starting at offset 0.
  • Consumers read the length field at the current position. If length == 0 (skip marker), they advance to the buffer start. Otherwise, they read length bytes as a complete framed record.
  • The skip-marker value 0 cannot conflict with start_seq because start_seq is the first u32 in the slot while length is the second. Consumers distinguish them by position within the slot layout.
  • This skip-marker approach wastes at most max_record_size - 1 bytes per wrap-around but avoids the complexity of split-record handling in the lock-free read path. The wasted space is bounded because detail_len is capped at 4096 bytes (audit records with payloads exceeding this limit are truncated and tagged with a DETAIL_TRUNCATED flag).

Overwrite safety protocol (torn-read prevention):

  • Each entry is bracketed by start_seq (first u32) and end_seq (last u32).
  • The producer writes: start_seq = write_seq | 1 (odd = write in progress), then the length + entry data, then end_seq = write_seq (even = complete), then increments write_seq by 2.
  • The consumer reads start_seq, copies the entry, reads end_seq. If start_seq != end_seq or start_seq is odd, the entry was being overwritten during the read — the consumer discards it and advances to the next entry.
  • This ensures consumers never observe torn data. The two protocols compose cleanly: framing uses the length field to find record boundaries, while the sequence stamps at the slot's edges detect concurrent overwrites.

Sequence numbers are monotonically increasing within each CPU's stream. The gap-detection behavior depends on the configured delivery mode:

  • Strict mode (audit_strict=true): Sequence numbers are gapless. The producing thread blocks (with configurable timeout, default 10 ms) when the ring buffer is full, applying backpressure to maintain the gapless invariant. Any gap in a single CPU's sequence under strict mode indicates a bug or tampering — the audit subsystem treats this as a critical security event, emitting a synthetic AUDIT_LOST record and raising an alert via the FMA subsystem (Section 39).
  • Best-effort mode (default): Gaps are possible when the ring buffer overflows (oldest records are overwritten). Consumers detect lost records by observing discontinuities in the per-CPU sequence numbers. The records_dropped counter (per-CPU AtomicU64) tracks how many records were lost, and the drain thread includes the drop count in periodic "audit health" meta-records so that log consumers can quantify the gap.

40.9.3 Audit Policy Engine

The audit policy engine determines which events are recorded. Default policy:

  • Always audit: all capability denials, all authentication events, all exec calls.
  • Configurable: capability grants, driver lifecycle, configuration changes.

Policy is expressed as eBPF programs attached to audit tracepoints — the same mechanism used for security monitoring (section 40.4). The delivery guarantee depends on the configured mode: best-effort (default) or strict (guaranteed delivery).

Overflow policy. The two delivery modes described above (Section 40.9.2) govern overflow behavior:

  • Best-effort mode (default, audit_strict=false): Drop-oldest policy. When the per-CPU ring buffer is full, the oldest unread record is overwritten. The kernel prioritizes system availability over audit completeness.
  • Strict mode (audit_strict=true boot parameter): The emitting thread blocks with a configurable timeout (default 10 ms) rather than overwriting records, maintaining gapless per-CPU sequences. To prevent self-deadlock, the drain thread is exempt from auditing its own operations via the non-audited I/O path:
  • The drain thread writes directly to a dedicated audit partition (configured at boot via audit_log_dev= kernel parameter) using raw block I/O (submit_bio directly to the block device) rather than the VFS write() syscall path that would trigger audit events. The audit partition must be a separate block device or partition — not a file on a journaled filesystem — to avoid bypassing filesystem journaling (raw submit_bio to data blocks on a journaled filesystem like ext4/XFS would corrupt the journal's consistency guarantees). The audit subsystem manages its own simple log-structured layout on this partition (sequential append with a header containing magic, version, and write offset).
  • On systems where a dedicated partition is unavailable, the fallback audit_log_path= parameter specifies a regular file, but in this mode the drain thread uses the VFS write path with the PF_NOAUDIT flag (no audit recursion) instead of raw submit_bio. This is slower but preserves filesystem integrity.
  • The block device's I/O completion path is also marked non-auditable via a per-thread flag (PF_NOAUDIT) set on the drain thread at creation.
  • This design ensures the drain thread can always make forward progress even when the audit ring buffers are full, breaking the circular dependency. To mitigate the resulting audit blind spot, the drain thread's own activation and deactivation are logged as meta-events (type AuditDrainStart and AuditDrainStop, Section 40.9.1) by the audit subsystem's initialization code, not by the drain thread itself. This ensures the drain thread's lifecycle is observable even though its internal operations are not individually audited. RT tasks (SCHED_FIFO / SCHED_RR) always use the drop-oldest policy regardless of strict mode, to preserve real-time scheduling guarantees.

Timeout behavior in strict mode: After the 10 ms timeout expires and the ring buffer is still full, the current record IS emitted by force-evicting the oldest unread entry (converting temporarily to best-effort for that single entry). This preserves the guarantee that the current event is never lost — only the oldest, already-in-buffer event can be evicted (which the drain thread is expected to have forwarded already). The eviction is tracked via two mechanisms: a per-CPU forced_eviction_count: AtomicU64 counter (incremented on each forced eviction), and a per-CPU missed_sequences sideband buffer that records the sequence number of each evicted entry. The drain thread reads the missed_sequences buffer and emits a synthetic AUDIT_LOST record for each evicted sequence, ensuring the gap is visible to verifiers even though the evicted record's content is irretrievably lost. This approach avoids both deadlock (the producing thread never blocks indefinitely) and silent loss of the current security event (the event being audited right now is the one most likely to be attacker-relevant).

/// Attach an audit policy program. Unlike debug tracepoints, audit programs
/// participate in the guaranteed-delivery path.
pub fn attach_audit_policy(
    tracepoint: &'static StableTracepoint,
    prog: &VerifiedBpfProgram,
) -> Result<AuditPolicyHandle, AuditError> {
    // Verified eBPF program acts as filter: returns true to audit, false to skip.
    // The program can inspect all tracepoint arguments to make the decision.
    // ...
}

Rate limiting. To prevent audit log flooding from misbehaving or malicious processes, each event type has a configurable rate limit (default: 10 000 events/second per type). When the rate limit is hit, the audit subsystem coalesces events into a single summary record (e.g., "5 327 additional CAP_READ denials from pid 4001 suppressed in last 1s") rather than silently dropping them. The summary record consumes a sequence number in the per-CPU stream, preserving the sequence continuity invariant (in strict mode) while bounding log volume.

40.9.4 Tamper-Evident Log Chain

Each CPU maintains an independent HMAC chain over its audit records, making retroactive tampering detectable without cross-CPU synchronization:

Initial:   key[0] = HMAC-SHA256(boot_secret, cpu_id || "audit-chain-v1")
Evolution: key[n] = HMAC-SHA256(key[n-1], "evolve")
After computing key[n], key[n-1] is securely erased (zeroed).
Per-record: hmac[n] = HMAC-SHA256(key[n], hmac[n-1] || serialize(record[n]))

Important: The HMAC is NOT stored in the per-CPU ring buffer. The AuditRecord struct in the ring buffer contains only the event data (timestamp, event type, subject, object, detail). The HMAC is computed by the drain thread when it reads records from the per-CPU ring buffer and writes them to the persistent audit log. In the persistent format, each record (strict mode) or each batch (batched mode) is stored alongside its HMAC. This separation keeps the hot-path ring buffer append fast (no cryptographic operations on the recording CPU) while ensuring tamper evidence in the durable log.

Performance cost. HMAC-SHA256 on a typical 56-byte audit record header plus variable-length detail (average ~200-300 bytes total) takes approximately 200-500 ns per record on modern hardware. At sustained high audit rates (100K records/sec in an audit-heavy workload), this costs 20-50 ms/sec of CPU time, roughly 2-5% of one core. For most workloads (1K-10K records/sec) the cost is negligible. To mitigate the cost for non-security-critical events, HMAC computation is configurable per event category: security-critical categories (CapGrant, CapDeny, CapRevoke, Auth) always use strict mode (per-record HMAC with per-record key evolution); other categories can be configured to use batched mode or no HMAC at all. The two HMAC modes operate as follows:

  • Strict HMAC mode (per-record): Each record receives its own HMAC. The HMAC key evolves after every record: key[n] = HMAC-SHA256(key[n-1], "evolve"), and key[n-1] is erased. This provides per-record tamper evidence and forward secrecy.
  • Batched HMAC mode (per-batch): A single HMAC covers N consecutive records (default N=16). All records in a batch are serialized and HMACed together under a single key. The key evolves per-batch (not per-record): after computing the HMAC for batch B, the key evolves once to produce the key for batch B+1, and the previous key is erased. This amortizes the cryptographic cost across N records while still providing tamper evidence at batch granularity and forward secrecy at batch boundaries. Within a batch, individual record tampering is still detectable because the batch HMAC covers all records in sequence.

Batch framing: In batched mode, the drain thread accumulates N records and then writes a batch seal — a special entry with AuditEventType::HmacBatchSeal containing the batch HMAC, the batch sequence range (first and last sequence numbers covered), and the evolved key's public commitment (SHA-256 of the next key, enabling verifiers to detect key-evolution breaks). Individual records within a batch carry no HMAC; tamper evidence is provided solely by the batch seal. The verifier reads records from the persistent log until it encounters an HmacBatchSeal, verifies the batch HMAC over all preceding unsealed records (by re-serializing and re-HMACing them), then continues to the next batch. If the log ends mid-batch (e.g., crash before the drain thread wrote the seal), the trailing unsealed records are flagged as unverifiable in the verification report.

The per-category HMAC policy is set via /proc/isle/audit/hmac_policy.

The boot_secret is derived from: (a) TPM-sealed entropy if a TPM is present (preferred), or (b) hardware RNG (RDRAND/RNDR/platform-specific entropy source) collected during early boot if no TPM. There is no static fallback key embedded in the kernel image, because a key readable from the kernel binary would defeat tamper evidence (any attacker with access to the image could forge the HMAC chain). On systems without both TPM and hardware RNG, the audit subsystem generates a random seed from whatever entropy is available at boot (interrupt timing jitter, memory contents) and stores it only in kernel memory. This seed is lost on reboot, which means the HMAC chain cannot be verified across reboots without a TPM. However, within a single boot the chain provides full tamper evidence, and the inability to verify across reboots is an acceptable tradeoff: it prevents an attacker who obtains the kernel image from forging audit records. For offline log verification, the SHA-256 hash of the boot_secret (not the secret itself) is recorded in the first audit record of each boot, allowing a verifier with the original secret to replay the HMAC chain.

Each CPU's HMAC chain starts from a known initial value at boot, seeded with the CPU ID to ensure chains are distinguishable. The key material is stored in isle-core kernel memory, protected by the kernel's core isolation domain (inaccessible to drivers and all userspace). On x86, this corresponds to PKEY 0; on other architectures, equivalent protection is provided by the platform's isolation mechanism (Section 3). Even a compromised Tier 2 driver cannot read or forge the audit HMAC.

Forward secrecy. In strict HMAC mode, the key evolves after every record: key[n] is derived from key[n-1], and key[n-1] is securely erased (zeroed) immediately after computing key[n]. In batched HMAC mode, the key evolves after every batch (not every record), so forward secrecy granularity is per-batch rather than per-record. In both modes, forward secrecy applies to live key material in kernel memory: compromising the current in-memory HMAC key does not allow forging past records (or past batches), because previous keys have been erased from memory. An attacker who captures the current key can forge records from that point onward but cannot reconstruct earlier keys.

Forward secrecy does not conflict with post-crash verification, because verification is performed by an offline verifier that holds the original boot_secret, not by the running kernel. The verifier derives the full key sequence deterministically from boot_secret (the same KDF used during recording: key[0] = HMAC-SHA256(boot_secret, cpu_id || "audit-chain-v1"), key[n] = HMAC-SHA256(key[n-1], "evolve")). Erased in-memory keys are not recoverable from the running kernel; boot_secret is the durable verification root. On TPM-equipped systems with HMAC key checkpointing enabled (see below), derived chain keys may be written to persistent storage in TPM-encrypted form — forward secrecy in that configuration is bounded by the checkpoint interval (at most 1,000 derivations). On systems without checkpointing, no derived keys are ever persisted. On TPM-equipped systems the boot_secret is TPM-sealed (see "TPM-sealed audit key" below) and is the only secret that must be protected for offline verification. On non-TPM systems, the SHA-256 hash of the boot_secret is recorded in the first audit record so verifiers can confirm they hold the correct secret before replaying the chain.

HMAC key checkpointing: To prevent O(N) re-derivation after a crash, the current HMAC chain key is checkpointed to persistent storage every 1,000 derivations (or every 60 seconds, whichever comes first). The checkpoint is encrypted with a TPM-sealed key (Section 22) and stored alongside the audit log. On crash recovery, re-derivation starts from the most recent checkpoint rather than the root — at most 1,000 HMAC derivations (~50 μs at ~50 ns per HMAC) rather than potentially millions. The checkpoint interval is configurable via audit.hmac_checkpoint_interval. The checkpoint itself is a single 64-byte write (32-byte key + 32-byte HMAC of the key for integrity) — negligible I/O overhead.

Crash recovery. If the system crashes mid-chain (e.g., between writing a record's data and computing its HMAC), the HMAC chain must be resumable. On boot, the audit subsystem performs the following recovery procedure for each CPU's persisted chain:

  1. Read the last complete record (one with a valid HMAC field) from the persisted log.
  2. Re-derive key[record_seq] from the boot_secret (which is either TPM-unsealed or re-entered by the administrator for offline verification), and verify that record's HMAC. The boot_secret is the durable root; in-memory keys that were erased during normal operation can always be re-derived from it.
  3. If verification succeeds, the chain is intact up to that record. Any data following the last valid HMAC is treated as an incomplete record: it is preserved in the log but tagged with a special AUDIT_CRASH_TRUNCATED sentinel (hmac field set to all 0xFF bytes) and a flag indicating the record is unverified.
  4. The new boot starts a fresh HMAC chain (new boot_secret, new key[0] per CPU). The first record of the new chain includes a back-reference to the last verified record of the previous boot (boot_id, cpu_id, sequence number) so that verifiers can link chains across reboots.

Records with the AUDIT_CRASH_TRUNCATED sentinel are excluded from HMAC chain verification but remain in the log for forensic analysis. Log consumers and the isle-audit-verify tool recognize the sentinel and report "N crash-truncated records found" rather than treating them as tampering.

Verification. Any consumer of the audit log can verify each per-CPU chain independently by re-deriving the key sequence from boot_secret and replaying the HMAC computation over that CPU's stored records. A mismatch at position N in CPU C's chain proves that record N (or a predecessor on that CPU) was modified or deleted after writing. The per-CPU design means a single CPU's chain can be verified without needing records from other CPUs. The isle-audit-verify tool accepts the boot_secret (or, on TPM systems, unseals it automatically) and performs the full chain replay offline.

TPM-sealed audit key (optional). When a TPM 2.0 is available, the HMAC key is sealed to the TPM's PCR state and only unsealed when the boot chain is verified (Section 22). If an attacker modifies the kernel or early boot components, the audit key becomes unavailable — the system cannot produce valid audit HMACs, making the compromise visible to remote attestation.

40.9.5 Linux Compatibility

ISLE exports audit records in formats that standard Linux audit tools understand, so existing security infrastructure works without modification.

auditctl / ausearch / aureport. ISLE translates its audit records to Linux audit format (type=SYSCALL msg=audit(...), type=AVC, type=USER_AUTH, etc.) and delivers them to userspace via the NETLINK_AUDIT socket (see below). The kernel never writes to /var/log/audit/audit.log directly — that is the responsibility of the userspace audit daemon (auditd or go-audit), which receives records from the netlink socket and handles persistence. This avoids circular dependencies (audit subsystem depending on VFS, block I/O, etc.) and maintains kernel/userspace separation. The auditctl command sets filter rules, which are internally compiled to eBPF audit policy programs.

audit netlink socket. ISLE implements the NETLINK_AUDIT protocol so that auditd or go-audit can receive events in real time. The translation layer maps ISLE capability events to the closest Linux audit message types (e.g., capability denial maps to type=AVC).

journald integration. Audit records are also forwarded to the systemd journal with structured fields (_AUDIT_TYPE=, _AUDIT_ID=, OBJECT_PID=, etc.), making them queryable via journalctl:

journalctl _AUDIT_TYPE=AVC --since "5 minutes ago"
journalctl _AUDIT_TYPE=USER_AUTH _HOSTNAME=prod-web-01

syslog forwarding. For centralized log collection (Splunk, Elasticsearch, Graylog), audit records can be forwarded over syslog (RFC 5424) with configurable facility and severity mapping via /etc/isle/audit.conf.


40a. Debugging and Process Inspection

Inspired by: Solaris/illumos mdb, Linux ptrace, seL4 capability-gated debugging. IP status: Clean — ptrace is a standard POSIX interface; capability-gating access control is a design policy, not a patentable mechanism.

40a.1 Capability-Gated ptrace

Linux problem: ptrace is a powerful but coarse-grained tool. A single PTRACE_ATTACH call gives the debugger complete control over the target process: read and write arbitrary memory, read and write registers, inject signals, single-step instructions, intercept syscalls. Access control is limited to UID checks and the optional Yama LSM (/proc/sys/kernel/yama/ptrace_scope). In container environments, ptrace across namespace boundaries is blocked by user namespace checks, but within a namespace any process with the same UID (or CAP_SYS_PTRACE) can attach to any other. There is no way to grant partial debug access — it is all or nothing.

ISLE design: Each ptrace operation requires a specific combination of capabilities. The debugger must hold explicit capability tokens scoped to the target process and the operations it intends to perform.

Operation Required Capability
PTRACE_ATTACH / PTRACE_SEIZE CAP_DEBUG on target process
PTRACE_PEEKDATA / PTRACE_POKEDATA CAP_DEBUG + READ or WRITE on target address space
PTRACE_GETREGS / PTRACE_SETREGS CAP_DEBUG + READ or WRITE on register state
PTRACE_SINGLESTEP CAP_DEBUG + EXECUTE control
PTRACE_SYSCALL CAP_DEBUG + SYSCALL_TRACE
// isle-core/src/debug/ptrace.rs

/// Validate that the calling process holds sufficient capabilities
/// for the requested ptrace operation on the target.
fn check_ptrace_cap(
    caller: &Process,
    target: &Process,
    request: PtraceRequest,
) -> Result<(), CapError> {
    // Caller must hold CAP_DEBUG scoped to the target's object ID.
    let debug_cap = caller.cap_table.lookup(
        target.object_id(),
        PermissionBits::DEBUG,
    )?;

    // Additional permission checks based on the operation.
    match request {
        PtraceRequest::PeekData { .. } => {
            caller.cap_table.check(debug_cap, PermissionBits::READ)?;
        }
        PtraceRequest::PokeData { .. } => {
            caller.cap_table.check(debug_cap, PermissionBits::WRITE)?;
        }
        PtraceRequest::GetRegs => {
            caller.cap_table.check(debug_cap, PermissionBits::READ)?;
        }
        PtraceRequest::SetRegs { .. } => {
            caller.cap_table.check(debug_cap, PermissionBits::WRITE)?;
        }
        PtraceRequest::SingleStep => {
            caller.cap_table.check(debug_cap, PermissionBits::EXECUTE)?;
        }
        PtraceRequest::Syscall => {
            caller.cap_table.check(debug_cap, PermissionBits::SYSCALL_TRACE)?;
        }
        PtraceRequest::Attach | PtraceRequest::Seize => {
            // CAP_DEBUG alone is sufficient for attach.
        }
    }
    Ok(())
}

Namespace isolation: The CAP_DEBUG capability is scoped to the target's namespace. A debugger in namespace A cannot debug a process in namespace B unless it holds a CAP_DEBUG token that was explicitly delegated across the namespace boundary (using the standard capability delegation mechanism from Section 11). Cross-namespace debugging requires both a CAP_DEBUG on the target and a CAP_NS_TRAVERSE on every intermediate namespace. This makes container breakout via ptrace structurally impossible — there is no ambient authority to override.

seccomp interaction: When a debugger attaches to a seccomp-sandboxed process, the sandbox remains in effect. The debugger can observe syscalls (with PTRACE_SYSCALL) but cannot inject syscalls that the target's seccomp filter would deny. This prevents a class of attacks where a debugger is used to bypass seccomp restrictions.

Domain isolation interaction: As described in Section 3, ptrace reads/writes to domain-protected memory go through the kernel-mediated PKRU path. The debugger never gains direct access to the target's isolation domain. Capability checks happen before the kernel performs the PKRU switch.

40a.2 Hardware Debug Registers

Each architecture provides hardware breakpoint and watchpoint registers. ISLE exposes these through the ptrace interface, with debug register state saved and restored as part of ArchContext on every context switch.

Architecture HW Breakpoints HW Watchpoints Mechanism
x86-64 4 total (DR0-DR3, shared) ← shared with breakpoints DR7 configures each DR0-DR3 as breakpoint or watchpoint; 4 total in any combination. DR6 status
AArch64 2-16 (implementation defined) 2-16 (implementation defined) DBGBCR/DBGBVR, DBGWCR/DBGWVR
ARMv7 2-16 (implementation defined) 2-16 (implementation defined) DBGBCR/DBGBVR via cp14
RISC-V via trigger module via trigger module tselect, tdata1-tdata3 CSRs
PPC32 1 (IAC1) 1-2 (DAC1, DAC2) IAC/DAC SPRs, DBCR0/DBCR1 control
PPC64LE 1 (CIABR) 1 (DAWR0), 2 on POWER10 (DAWR0/1) CIABR, DAWR0/DAWRX0 SPRs
// isle-core/src/arch/x86_64/context.rs (excerpt)

/// x86-64 debug register state, saved/restored on context switch.
#[repr(C)]
pub struct DebugRegState {
    /// Address breakpoint registers.
    pub dr0: u64,
    pub dr1: u64,
    pub dr2: u64,
    pub dr3: u64,
    /// Debug status register (read on debug exception, cleared after).
    pub dr6: u64,
    /// Debug control register (enables breakpoints, sets conditions).
    pub dr7: u64,
}
// isle-core/src/arch/aarch64/context.rs (excerpt)

/// AArch64 debug register state. The number of breakpoint/watchpoint
/// register pairs is discovered at boot via ID_AA64DFR0_EL1.
pub struct DebugRegState {
    /// Breakpoint control/value register pairs.
    pub bcr: [u32; MAX_HW_BREAKPOINTS],
    pub bvr: [u64; MAX_HW_BREAKPOINTS],
    /// Watchpoint control/value register pairs.
    pub wcr: [u32; MAX_HW_WATCHPOINTS],
    pub wvr: [u64; MAX_HW_WATCHPOINTS],
    /// Actual number of pairs available on this CPU.
    pub num_brps: u8,
    pub num_wrps: u8,
}

Debug register state is part of ArchContext (Section 12a.1) and is saved/restored on every context switch. When a hardware breakpoint or watchpoint fires, the CPU generates a debug exception (#DB on x86-64, Breakpoint Exception (EC=0x30/0x31) on AArch64, breakpoint exception on RISC-V). Hardware watchpoints generate Watchpoint Exception (EC=0x34/0x35) on AArch64. The exception handler checks whether the faulting thread is being ptraced. If so, the event is delivered to the debugger as a SIGTRAP with si_code set to TRAP_HWBKPT. If the thread is not being debugged, the signal is delivered directly to the process (default action: terminate with core dump).

40a.3 Core Dump Generation

On receipt of a fatal signal (SIGSEGV, SIGABRT, SIGFPE, SIGBUS, SIGILL, SIGSYS), the kernel generates an ELF core dump before terminating the process.

Contents of a core dump:

  • Register state (general-purpose, floating-point, vector, debug registers)
  • Memory mappings (VMA list with permissions, file backing, offsets)
  • Writable memory segments (stack, heap, anonymous mappings)
  • Signal information (siginfo_t for the fatal signal)
  • Auxiliary vector (AT_* entries)
  • Thread list with per-thread register state (for multi-threaded processes)

Capability gating: Core dump generation requires write access to the dump destination. The process's capability set must include WRITE on the target path (a filesystem location) or WRITE on the pipe to a handler program (configured via /proc/sys/kernel/core_pattern, same as Linux). If the process does not hold the required capability, no dump is written and the kernel logs a diagnostic message.

Core dump filter: A per-process bitmask controls which VMA types are included, compatible with Linux's /proc/pid/coredump_filter:

Bit VMA Type Default
0 Anonymous private on
1 Anonymous shared on
2 File-backed private off
3 File-backed shared off
4 ELF headers on
5 Private huge pages on
6 Shared huge pages off
7 Private DAX pages off
8 Shared DAX pages off

Compressed core dumps: When isle.coredump_compress=zstd is set (boot parameter or runtime sysctl), core dumps are compressed with zstd at level 3 before writing. For large processes (multi-GB heaps), this reduces dump size by 5-10x and reduces I/O time, making core dumps practical in production. The resulting file has a standard zstd frame that tools can decompress before loading into GDB.

40a.4 Kernel Debugging and Crash Dumps

Kernel panic handler: On kernel panic, the handler (Section 11a.3, Tier 0 code) captures a comprehensive snapshot of system state:

  1. Register state of all CPUs — The panicking CPU sends an IPI (or NMI on x86-64 if IPIs are not functioning) to all other CPUs. Each CPU saves its register state to a per-CPU crash save area and halts.
  2. Kernel stack — The faulting CPU's kernel stack is captured, with DWARF-based unwinding to produce a symbolic backtrace.
  3. Kernel log buffer — The most recent 64 KB of the kernel ring buffer (printk output) is included.
  4. Capability table state — A summary of the capability table (number of entries, recent grants/revocations) for post-mortem security analysis.
  5. Driver registry state — Status of all registered drivers, including tier, device bindings, and crash counts.

The panic handler writes all of this into the reserved crash region as an ELF core dump (see Section 9 for crash recovery and the NVMe polled write-out path).

kdump equivalent: For systems that require maximum crash dump reliability, ISLE supports reserving a crash kernel memory region at boot (isle.crashkernel=256M). On panic, kexec loads the crash kernel, which boots into a minimal environment with a single purpose: write the dump to persistent storage and reboot. The crash kernel is stripped to the minimum — serial driver (Tier 0), block driver (Tier 0, polled mode), ELF writer, and nothing else. No scheduler, no interrupts, no capability system.

GDB remote stub: Development builds can enable a built-in GDB remote protocol stub (isle.gdb=serial or isle.gdb=net). This provides full kernel debugging over a serial port or UDP connection:

  • Set breakpoints in kernel code (software breakpoints via int3 / brk / ebreak)
  • Single-step through kernel execution paths
  • Read and write kernel memory and registers
  • Inspect per-CPU state, thread lists, and capability tables
  • Attach to a running kernel or halt at boot (isle.gdb_wait=1)

The GDB stub is compiled out of release builds (#[cfg(feature = "gdb-stub")]). It is never present in production kernels — this is a development-only facility.

Driver crash debugging: When a Tier 1 driver crashes, the fault handler (Section 39) captures the crash context before initiating recovery. The captured state includes:

  • Driver thread register state (all register classes)
  • Isolation domain state (PKRU value, domain assignment)
  • Driver-private memory snapshot (pages in the driver's isolation domain, up to a configurable limit, default 4 MB)
  • Recent ring buffer entries from the driver's communication channels
  • IOMMU mapping state for the driver's devices

This context is written to:

/sys/kernel/isle/drivers/{name}/crash_dump

The file persists until the next driver load or until explicitly cleared. Tools like isle-crashdump or GDB (with ISLE-aware scripts) can parse the dump for root cause analysis without reproducing the crash.

40a.5 /proc/pid Interface

ISLE provides compatibility with the Linux /proc/pid interface that debuggers, profilers, and monitoring tools depend on. Each entry is capability-checked individually.

Path Content Capability Required
/proc/pid/maps Memory mappings (address, perms, offset, device, inode, path) CAP_DEBUG or same-process
/proc/pid/mem Process memory (seek + read/write) CAP_DEBUG + READ/WRITE
/proc/pid/status Task state, memory usage, capability summary None (public fields) or CAP_DEBUG (private fields)
/proc/pid/stack Kernel stack trace CAP_DEBUG + KERNEL_READ
/proc/pid/syscall Current syscall number and arguments CAP_DEBUG
/proc/pid/wchan Wait channel (function name where task is sleeping) None
/proc/pid/coredump_filter Core dump VMA filter bitmask Same-process or CAP_DEBUG

Per-access capability checking: Unlike Linux, where /proc/pid/mem is checked at open() time and then freely readable, ISLE checks capabilities on every read() and write() call. This eliminates TOCTOU vulnerabilities where a capability is revoked between open and access — a revoked CAP_DEBUG takes effect immediately, even on already-open file descriptors.

// isle-compat/src/procfs/mem.rs

/// Read handler for /proc/pid/mem.
/// Capability is checked on every read, not just on open.
fn proc_pid_mem_read(
    file: &ProcFile,
    buf: &mut [u8],
    offset: u64,
) -> Result<usize, IoError> {
    let caller = current_process();
    let target = file.target_process();

    // Re-check capability on every access (not cached from open).
    if caller.pid() != target.pid() {
        caller.cap_table.lookup(
            target.object_id(),
            PermissionBits::DEBUG | PermissionBits::READ,
        ).map_err(|_| IoError::PermissionDenied)?;
    }

    // Perform the read through the kernel's memory access path.
    // For domain-protected memory, this goes through the PKRU
    // mediation path (Section 3).
    target.address_space().read_remote(offset, buf)
}

Public vs. private fields in /proc/pid/status: Fields like State, Pid, PPid, Uid, Gid, and VmSize are considered public and readable without CAP_DEBUG (same as Linux). Fields like VmPeak, VmData, VmStk, CapInh, CapPrm, CapEff, Seccomp, and voluntary_ctxt_switches are private and require CAP_DEBUG on the target. This prevents information leakage that could aid side-channel or timing attacks while preserving compatibility with tools like ps and top that only read public fields.


41. Unified Object Namespace

Inspired by: Windows NT Object Manager, Plan 9 namespace concepts. IP status: Clean — basic OS design concept from Multics (1960s), any NT patents expired (filed 1989-1993, expired 2009-2013).

41.1 Problem

Linux organizes kernel resources through multiple unrelated mechanisms:

  • Files/sockets/pipes → file descriptors (integer indices into per-process table)
  • Processes → PIDs (global integer namespace)
  • Signals → signal numbers (per-process bitmask)
  • IPC → sysv IPC keys, POSIX named semaphores, futex addresses
  • Devices → /dev nodes (major/minor numbers)
  • Kernel tunables → /proc/sys (sysctl)
  • Device tree → /sys (sysfs)
  • Timer resources → timerfd, POSIX timers (separate handle space)
  • Event notification → eventfd, epoll (yet another handle space)

There is no unified way to enumerate "all kernel resources held by process X" or "all kernel resources related to device Y." Each subsystem has its own introspection mechanism (or none at all).

ISLE already has a capability system (Section 11) where every resource is accessed through capability tokens. The object namespace makes this explicit and queryable — a hierarchical tree where every kernel object has a canonical path and uniform access control.

41.2 Design: Kernel-Internal Object Tree

// isle-core/src/namespace/mod.rs (kernel-internal)

/// Every kernel resource is an Object.
pub struct Object {
    /// Unique object ID (monotonically increasing, never reused).
    pub id: ObjectId,

    /// Object type (from existing cap/mod.rs ObjectType).
    pub object_type: ObjectType,

    /// Reference count.
    pub refcount: AtomicU32,

    /// Capability security descriptor.
    pub security: SecurityDescriptor,

    /// Type-specific data (union).
    pub data: ObjectData,
}

/// The namespace tree.
pub struct ObjectNamespace {
    /// Root directory of the namespace.
    root: ObjectDirectory,

    /// Sparse index for O(1) average-case lookup by ObjectId.
    ///
    /// ObjectIds are monotonically increasing and never reused, so a dense
    /// array indexed by ObjectId would grow without bound as objects are
    /// created and destroyed over the system's lifetime (e.g., a long-running
    /// server creates millions of short-lived process objects). A hash map
    /// provides O(1) average lookup while using memory proportional to the
    /// number of *live* objects, not to the total number ever allocated.
    ///
    /// The key is the ObjectId's numeric value. Entries are inserted on
    /// registration and removed when the object's refcount reaches zero.
    // Note: `HashMap` and `Vec` in these definitions are kernel-internal equivalents
    // (slab-backed hash table and bounded array, respectively), not `std` types. The
    // kernel uses `SlabHashMap` (slab-allocated, interrupt-safe, with RCU-protected
    // lookup) and `BoundedVec` (capacity-limited, slab-allocated). The `std`-style
    // type names are used here for readability.
    index: HashMap<u64, *mut Object>,
}

/// A directory in the namespace — contains named references to objects.
pub struct ObjectDirectory {
    /// This directory's object identity.
    object: Object,

    /// Named entries (sorted for binary search). For directories with fewer than
    /// 4096 entries, a sorted Vec provides cache-friendly binary search and compact
    /// memory layout. Directories exceeding 4096 entries (e.g., /Processes on busy
    /// servers) switch to a BTreeMap internally for O(log n) insertion, avoiding
    /// the O(n) element shift cost of Vec insertion at scale.
    entries: Vec<(ArrayString<64>, ObjectEntry)>,
}

pub enum ObjectEntry {
    /// Direct reference to an object.
    Object(ObjectId),
    /// Subdirectory.
    ///
    /// `Box` is required here: `ObjectEntry` is stored inside `ObjectDirectory`,
    /// which is stored inside `ObjectEntry::Directory`. Without indirection, the
    /// type would be infinitely sized and Rust would reject it at compile time.
    Directory(Box<ObjectDirectory>),
    /// Symbolic link to another path in the namespace.
    Symlink(ArrayString<256>),
}

41.3 Namespace Layout

/                                       (root)
+-- Devices                             (device registry mirror)
|   +-- pci0000:00
|   |   +-- 0000:00:1f.2               (SATA controller)
|   |   +-- 0000:03:00.0               (NVMe)
|   |   +-- 0000:04:00.0               (NIC)
|   +-- usb1
|   |   +-- usb1-1                      (hub)
|   |       +-- usb1-1.1               (keyboard)
|
+-- Drivers                             (loaded driver instances)
|   +-- isle-nvme                       (driver object)
|   +-- isle-e1000                      (driver object)
|   +-- isle-xhci                       (driver object)
|
+-- Processes                           (process objects)
|   +-- 1                               (init/systemd)
|   |   +-- Threads
|   |   |   +-- 1                       (main thread)
|   |   |   +-- 2                       (worker thread)
|   |   +-- Handles                     (fd table: capabilities)
|   |   |   +-- 0                       (stdin - pipe)
|   |   |   +-- 1                       (stdout - tty)
|   |   |   +-- 3                       (socket)
|   |   +-- Memory                      (VMA tree)
|   |   +-- Capabilities                (capability set)
|   +-- 42                              (some user process)
|
+-- Memory                              (physical memory regions)
|   +-- Node0                           (NUMA node 0)
|   +-- Node1                           (NUMA node 1)
|
+-- Network                             (network stack objects)
|   +-- Interfaces
|   |   +-- eth0                        (NIC)
|   |   +-- lo                          (loopback)
|   +-- Sockets                         (open sockets)
|
+-- IPC                                 (IPC endpoints)
|   +-- Pipes
|   +-- SharedMemory
|   +-- Semaphores
|
+-- Security                            (security policy objects)
|   +-- Capabilities                    (capability type registry)
|   +-- LSM                             (security module state)
|
+-- Health                              (FMA — Section 39)
|   +-- ByDevice
|   |   +-- 0000:03:00.0               (NVMe health)
|   +-- RetiredPages
|   +-- DiagnosisRules
|
+-- Scheduler                           (scheduler state)
|   +-- RunQueues
|   +-- CbsServers                      (Section 15)
|
+-- Tracing                             (Section 40)
    +-- StableTracepoints
    +-- AggregationMaps

41.4 What The Namespace Provides

Uniform enumeration: "Show me everything related to device 0000:03:00.0" is a namespace traversal starting at /Devices/pci0000:00/0000:03:00.0, following links to its driver instance (/Drivers/isle-nvme), its health data (/Health/ByDevice/0000:03:00.0), and any processes with open handles to it.

Uniform security: Every object in the namespace has a SecurityDescriptor that ties into the capability system. Access checks are uniform regardless of object type.

Uniform lifecycle: Objects are reference-counted. When refcount hits zero, the type-specific destructor runs. No per-subsystem cleanup code — the namespace manages object lifetime uniformly.

Cross-reference discovery: "What processes have handles to this device?" is a query across /Processes/*/Handles/*, filtering by target ObjectId. Without the namespace, this requires per-subsystem ad-hoc code.

41.5 Registration Strategy: Eager vs Lazy

Not all kernel objects are registered with equal urgency. High-frequency objects would add unacceptable overhead if registered on every creation:

Eagerly registered (created infrequently, high value for introspection): - Devices, drivers, processes, NUMA nodes, cgroups, IPC endpoints, security policies. - Registered at creation time, deregistered at destruction.

Lazily registered (created/destroyed at high frequency): - File descriptors, sockets, VMAs (virtual memory areas), anonymous pages. - The namespace entry is created on first query, not on creation. The namespace maintains a per-process "generation counter"; when a query finds a stale generation, it re-syncs from the kernel's authoritative data structures (fd table, VMA tree). - This means /proc/<pid>/isle/objects may have brief inconsistencies (a just-closed fd might still appear) but the hot path (open/close/read/write) has zero overhead.

Memory budget per eagerly-registered object:

Component Per Object Notes
Object struct (fixed fields) ~64 bytes ID, type, refcount, security
Namespace entry (name + link) ~80 bytes ArrayString<64> + pointer
SlabHashMap index entry ~16 bytes Pointer + occupancy bit
Total per object ~160 bytes

Typical system: ~2000 eagerly registered objects (devices + drivers + processes) = ~320 KB baseline. File descriptors are lazy, so they don't add to this baseline.

41.6 Linux Interface Exposure — Standard Mechanisms

The namespace is kernel-internal. Linux applications never see it. But ISLE-specific tools can access it through standard Linux interfaces as additive extensions:

Via procfs (new entries under /proc/isle/, additive):

/proc/isle/objects/
    summary             # Total object count by type
    by_type/
        Device          # List of all device objects
        Process         # List of all process objects
        FileDescriptor  # List of all open FDs system-wide
        Socket          # List of all sockets
        Capability      # List of all capability grants
    by_id/
        <object_id>     # Full details of a specific object

/proc/isle/namespace/
    tree                # Full namespace tree (text dump, similar to NT WinObj)
    resolve/<path>      # Resolve a namespace path to an object

/proc/<pid>/isle/
    capabilities        # Full capability set for this process
    objects             # All objects this process holds handles to
    namespace_view      # This process's view of the namespace

Via sysfs (additive attributes on existing device nodes):

/sys/devices/.../isle_object_id     # Object ID in the namespace
/sys/devices/.../isle_refcount      # Current reference count
/sys/devices/.../isle_capabilities  # Capabilities granted for this device

Via a pseudo-filesystem (islefs, mountable):

mount -t islefs none /mnt/isle

# Now the object namespace is browsable as a filesystem:
ls /mnt/isle/Devices/pci0000:00/
cat /mnt/isle/Processes/42/Capabilities
cat /mnt/isle/Health/ByDevice/0000:03:00.0/status

This is the same pattern as Linux's debugfs, tracefs, configfs — a pseudo- filesystem for kernel introspection. No new syscalls. Standard open/read/readdir. Any Linux tool that can read files can introspect the namespace.

41.7 islefs Detail

// isle-compat/src/islefs/mod.rs

/// islefs: pseudo-filesystem exposing the object namespace.
/// Mounted via: mount -t islefs none /mountpoint
///
/// Read-only by default. Write access (for admin operations like
/// forcing a driver reload or revoking a capability) requires
/// CAP_SYS_ADMIN (mapped to ISLE's admin capability set).
pub struct IsleFs {
    /// Reference to the kernel's object namespace.
    namespace: &'static ObjectNamespace,

    /// Mount options.
    options: IsleFsMountOptions,
}

pub struct IsleFsMountOptions {
    /// Show only objects matching this type filter.
    pub type_filter: Option<ObjectType>,

    /// Show only objects accessible to this UID.
    pub uid_filter: Option<u32>,

    /// Maximum depth of directory listing (avoid huge listings).
    pub max_depth: u32,

    /// Enable write operations (default: false).
    pub writable: bool,
}

islefs file format for object details:

$ cat /mnt/isle/Devices/pci0000:00/0000:03:00.0
type: Device
id: 4217
refcount: 3
state: Active
driver: isle-nvme
tier: 1
bus: pci
vendor: 0x144d
device: 0xa808
class: 0x010802
numa_node: 0
health: ok
power: D0Active
capabilities_granted: 2
  cap[0]: DMA_ACCESS (perms: READ|WRITE)
  cap[1]: INTERRUPT (perms: MANAGE_IRQ)
handles_held_by:
  process 1 (systemd): fd 7
  process 834 (postgres): fd 12
  process 835 (postgres): fd 13

41.8 Admin Operations via islefs (write)

With writable mount option and admin capabilities:

# Force a driver reload (triggers crash recovery path on a healthy driver)
echo "reload" > /mnt/isle/Drivers/isle-nvme/control

# Revoke a specific capability
echo "revoke 4217:0" > /mnt/isle/Security/Capabilities/control

# Disable a device (sets to Error state in registry)
echo "disable" > /mnt/isle/Devices/pci0000:00/0000:04:00.0/control

# Change FMA diagnosis rule threshold
echo "threshold 200" > /mnt/isle/Health/DiagnosisRules/dimm_degradation/control

# Force tier demotion
echo "demote 2" > /mnt/isle/Drivers/isle-e1000/tier

These are all standard file write operations. Any shell script, ansible playbook, or management tool can use them. No custom CLI tools required.

41.9 How Subsystems Register Objects

Each kernel subsystem registers its objects with the namespace during initialization:

// Example: Device registry registers each device node.
// In isle-core/src/registry/mod.rs:

fn register_device_node(&mut self, node: &DeviceNode) {
    // Note: `format!()` here represents `ArrayString::from_fmt()` (stack-allocated,
    // fixed-capacity formatting). The kernel does not use heap-allocated `String`.
    let path = format!("/Devices/{}", node.sysfs_path());
    self.namespace.register(path, Object::from_device(node));
}

// Example: Process creation registers process object.
// In isle-core/src/sched/process.rs:

fn create_process(&mut self, ...) -> Process {
    let proc = Process::new(...);
    // Note: `format!()` here represents `ArrayString::from_fmt()` (stack-allocated,
    // fixed-capacity formatting). The kernel does not use heap-allocated `String`.
    let path = format!("/Processes/{}", proc.pid);
    self.namespace.register(path, Object::from_process(&proc));
    proc
}

// Objects are automatically deregistered when they are destroyed
// (refcount -> 0 triggers namespace removal).

41.10 Relationship to Existing Interfaces

The namespace does NOT replace /proc, /sys, or /dev. Those remain as the Linux- compatible interfaces that existing tools depend on. The namespace is an additional unified view:

Interface Audience Purpose Status
/proc Linux tools (ps, top, etc.) Process info, kernel stats Required for compat
/sys Linux tools (udev, lspci, etc.) Device tree, attributes Required for compat
/dev Linux tools (everything) Device access Required for compat
/proc/isle/* ISLE-aware admin tools Object namespace queries New, additive
islefs ISLE-aware admin tools Full namespace browse/control New, additive, optional

The namespace is the kernel-internal source of truth. /proc, /sys, and /dev are generated from it (just as /sys is generated from the device registry, and /proc from process state). The difference is that the namespace provides a unified view where cross-subsystem queries are natural.

41.11 Unified Management CLI (islectl)

Multiple kernel subsystems expose sysfs/procfs interfaces that administrators interact with through different tools (isle-mltool, veritysetup, sysctl, direct sysfs writes). islectl provides a single-entry-point CLI for ISLE system administration.

Design principle: islectl is a userspace tool that reads from the Unified Object Namespace (§41) and writes to sysfs/procfs. It is a convenience layer, not a privileged daemon — every operation it performs is also possible via direct sysfs writes or existing tools. This ensures the kernel never depends on islectl and that scripting/automation can bypass it entirely.

Subcommands:

Subcommand Description Kernel sections
islectl device list\|info\|health Device registry queries §7
islectl driver load\|unload\|tier Driver management §5, §6
islectl policy list\|swap\|compare Policy module management §51
islectl intent set\|status\|explain Intent management §53
islectl fma rules\|events\|status Fault management queries §39
islectl evolve status\|apply\|rollback Live evolution management §52
islectl cluster join\|leave\|status\|nodes Cluster orchestration §47
islectl power budget\|status Power budgeting §49

Output format: JSON by default (machine-parseable for automation pipelines). --human flag enables formatted tables for interactive use. --watch streams updates in real-time (poll-based on sysfs inotify).

Cluster mode: When run on a cluster node, subcommands accept a --cluster flag to operate on the cluster fabric via distributed IPC (§47). islectl cluster status shows all nodes. islectl --node=N device list queries a specific remote node. Cluster operations are strictly read-only unless explicitly confirmed with --yes (to prevent accidental cross-node mutations).

Implementation phases: - Phase 3: Basic device, driver, and policy commands - Phase 4: FMA, intent, and evolution management - Phase 5: Cluster commands and cross-node operations