Sections 34–36 of the ISLE Architecture. For the full table of contents, see README.md.
Part VIII: Networking
TCP extensibility, network overlays, and interface naming.
34. TCP Stack Extensibility
Linux problem: MPTCP took many years to get into mainline because it required deep
changes to the TCP stack. The monolithic TCP implementation made it hard to add new
transport protocols. Congestion control algorithms are pluggable, but the socket layer
itself is tightly coupled to TCP internals — adding a fundamentally new transport like
QUIC kernel offload requires invasive surgery across net/ipv4/, net/ipv6/, and the
socket layer.
ISLE design:
34.1 Network Stack Architecture
isle-net is the Tier 1 network stack. It runs in its own isolation domain, isolated from isle-core. The kernel never executes protocol processing directly — all network I/O crosses the domain boundary via ring buffers (~23 cycles per crossing, Section 3).
The stack is layered, with each layer communicating through well-defined internal interfaces:
Application (userspace)
| syscall (socket, bind, listen, accept, read, write, sendmsg, recvmsg)
v
isle-core: socket dispatch (translates fd ops to isle-net ring commands)
| domain ring buffer (~23 cycles)
v
isle-net (Tier 1):
Socket layer (protocol-agnostic)
|
Transport layer (TCP, UDP, SCTP, MPTCP)
|
Network layer (IPv4, IPv6, routing, netfilter)
|
Link layer (ARP, NDP, bridge, VLAN)
| domain ring buffer (~23 cycles)
v
NIC driver (Tier 1): device-specific TX/RX
The four domain switches (two domain entries — NIC driver and isle-net — each requiring an
enter and exit switch) add ~92 cycles total to the data path (4 × ~23 cycles; see
Section 34.5 for detailed analysis). For comparison, a single
sendmsg() syscall in Linux costs ~700-1800 cycles in syscall transition overhead on modern hardware with Spectre/Meltdown mitigations enabled (pre-mitigation: ~200-400 cycles)
(the SYSCALL/SYSRET ring crossing, not a full Linux process context switch which costs
5,000-20,000 cycles due to TLB flushes, cache pollution, and scheduler overhead).
The domain boundary is cheaper than even a bare syscall transition.
Linux comparison: Linux's network stack is monolithic — TCP, IP, Netfilter, and the socket layer all execute in the same address space with no isolation. A buffer overflow in a Netfilter module can corrupt TCP connection state. In ISLE, isle-net's isolation means a bug in the VXLAN tunnel parser cannot corrupt the TCP congestion window of an unrelated connection because Rust's type system and memory safety enforce separation within the isle-net module. Hardware domain isolation enforces the boundary between isle-net and the NIC driver (and between isle-net and isle-core); within isle-net, Rust's ownership model provides the intra-module isolation.
34.2 Socket Abstraction
The socket layer is protocol-agnostic. Transport protocols register implementations of a common trait:
/// Protocol-agnostic socket operations.
/// Each transport protocol (TCP, UDP, SCTP, MPTCP) implements this trait.
pub trait SocketOps: Send + Sync {
/// Bind the socket to a local address.
fn bind(&self, addr: &SockAddr) -> Result<(), KernelError>;
/// Mark the socket as a passive listener.
fn listen(&self, backlog: u32) -> Result<(), KernelError>;
/// Accept an incoming connection (blocking or non-blocking).
fn accept(&self) -> Result<(SlabRef<dyn SocketOps>, SockAddr), KernelError>;
/// Initiate an outgoing connection.
fn connect(&self, addr: &SockAddr) -> Result<(), KernelError>;
/// Send a message (scatter-gather, ancillary data, destination address).
fn sendmsg(&self, msg: &MsgHdr, flags: u32) -> Result<usize, KernelError>;
/// Receive a message (scatter-gather, ancillary data, source address).
fn recvmsg(&self, msg: &mut MsgHdr, flags: u32) -> Result<usize, KernelError>;
/// Set a socket option (protocol-specific behavior).
fn setsockopt(&self, level: i32, name: i32, val: &[u8]) -> Result<(), KernelError>;
/// Get a socket option value.
fn getsockopt(&self, level: i32, name: i32, buf: &mut [u8]) -> Result<usize, KernelError>;
/// Retrieves the local address of a bound socket.
/// Returns the address in `addr` and its length.
fn getsockname(&self, addr: &mut SockAddr) -> Result<usize, KernelError>;
/// Retrieves the remote address of a connected socket.
/// Returns the address in `addr` and its length.
fn getpeername(&self, addr: &mut SockAddr) -> Result<usize, KernelError>;
/// Poll for readiness events (POLLIN, POLLOUT, POLLERR, POLLHUP).
fn poll(&self, events: PollEvents) -> PollEvents;
/// Shut down part of a full-duplex connection.
fn shutdown(&self, how: ShutdownHow) -> Result<(), KernelError>;
/// Close the socket and release all resources.
/// For TCP: initiates FIN handshake (or RST if SO_LINGER with timeout 0).
/// For UDP: releases port binding and queued buffers.
/// Called when the last file descriptor reference is dropped (VFS layer
/// guarantees exactly one call per socket lifetime). Error recovery paths
/// do NOT call close() directly — they mark the socket as errored and let
/// the VFS drop path handle cleanup.
///
/// Close is **best-effort**: if close() returns Err (e.g., TCP FIN
/// handshake timeout), the error is logged and the socket is released
/// regardless. The VFS layer always frees the socket resources after
/// this call, matching Linux semantics where close(2) errors on
/// sockets are not retryable.
fn close(&self) -> Result<(), KernelError>;
}
/// Socket address structure. Matches Linux's `struct sockaddr_storage` (128 bytes)
/// to accommodate all address families (AF_INET, AF_INET6, AF_UNIX, etc.).
/// The `family` field discriminates the actual address type.
#[repr(C)]
pub struct SockAddr {
/// Address family (AF_INET, AF_INET6, AF_UNIX, etc.).
pub family: u16,
/// Address data. Interpretation depends on family:
/// - AF_INET: bytes [2..8] = struct sockaddr_in (port, addr, padding)
/// - AF_INET6: bytes [2..28] = struct sockaddr_in6 (port, flowinfo, addr, scope_id)
/// - AF_UNIX: bytes [2..108] = struct sockaddr_un (path)
pub data: [u8; 126],
}
/// Socket factory trait. Each protocol registers a factory during initialization.
/// The factory creates socket instances when `socket()` syscall is invoked.
pub trait SocketFactory: Send + Sync {
/// Create a new socket instance.
/// `family` is AF_INET or AF_INET6. `sock_type` is SOCK_STREAM, SOCK_DGRAM, etc.
/// `protocol` is IPPROTO_TCP, IPPROTO_UDP, etc. (or 0 for default).
fn create_socket(
&self,
family: AddressFamily,
sock_type: SocketType,
protocol: u16,
) -> Result<SlabRef<dyn SocketOps>, KernelError>;
}
/// Address family constants (matches Linux AF_* values).
#[repr(u16)]
pub enum AddressFamily {
Unspec = 0, // AF_UNSPEC
Unix = 1, // AF_UNIX / AF_LOCAL
Inet = 2, // AF_INET (IPv4)
Inet6 = 10, // AF_INET6 (IPv6)
Netlink = 16, // AF_NETLINK
// ... other families as needed
}
/// Socket type constants (matches Linux SOCK_* values).
#[repr(u32)]
pub enum SocketType {
Stream = 1, // SOCK_STREAM (TCP)
Dgram = 2, // SOCK_DGRAM (UDP)
Raw = 3, // SOCK_RAW
Seqpacket = 5, // SOCK_SEQPACKET (SCTP)
// ... other types as needed
}
/// Poll event bitflags (matches Linux POLL* values).
#[repr(u16)]
pub struct PollEvents(u16);
impl PollEvents {
pub const IN: Self = Self(0x0001); // POLLIN
pub const PRI: Self = Self(0x0002); // POLLPRI
pub const OUT: Self = Self(0x0004); // POLLOUT
pub const ERR: Self = Self(0x0008); // POLLERR
pub const HUP: Self = Self(0x0010); // POLLHUP
pub const NVAL: Self = Self(0x0020); // POLLNVAL
}
/// Shutdown direction (matches Linux SHUT_* values).
#[repr(i32)]
pub enum ShutdownHow {
Rd = 0, // SHUT_RD
Wr = 1, // SHUT_WR
RdWr = 2, // SHUT_RDWR
}
/// Message header for sendmsg/recvmsg (matches Linux struct msghdr).
#[repr(C)]
pub struct MsgHdr {
/// Optional destination address (for connectionless sockets).
pub msg_name: *mut SockAddr,
pub msg_namelen: u32,
/// Scatter-gather array (iovec).
pub msg_iov: *mut IoVec,
pub msg_iovlen: usize,
/// Ancillary data (control messages).
pub msg_control: *mut u8,
pub msg_controllen: usize,
/// Flags.
pub msg_flags: i32,
}
/// I/O vector for scatter-gather I/O (matches Linux struct iovec).
#[repr(C)]
pub struct IoVec {
pub iov_base: *mut u8,
pub iov_len: usize,
}
/// Connection tracking state (matches Linux conntrack states).
#[repr(u8)]
pub enum ConntrackState {
New = 0,
Established = 1,
Related = 2,
Invalid = 3,
Untracked = 4,
}
/// NAT type applied to a connection.
#[repr(u8)]
pub enum NatType {
None = 0,
Snat = 1, // Source NAT
Dnat = 2, // Destination NAT
Masquerade = 3, // Source NAT with auto-IP
}
Socket concurrency model: The socket dispatch layer wraps each dyn SocketOps
in a per-socket RwLock<SlabRef<dyn SocketOps>>. The RwLock serializes socket
lifecycle operations (close, which drops the SlabRef) against concurrent data
operations (sendmsg, recvmsg). Socket-internal state uses fine-grained interior
mutability (per-field atomics or internal locks), so the outer RwLock is never
contended on the data path -- readers acquire the read lock for all data operations,
and only close acquires the write lock. Slab allocation avoids per-socket
heap allocation; SlabRef provides stable references suitable for the RwLock wrapper,
and matches the return type of SocketOps::accept(). The dispatch layer — not the trait
implementation — acquires the appropriate lock before calling trait methods:
- Data-path operations (sendmsg, recvmsg, poll) acquire a shared (read) lock,
allowing concurrent sends/receives from multiple threads (matching Linux's behavior
where multiple threads can read/write the same socket simultaneously).
- Lifecycle operations (close, shutdown, setsockopt, bind, listen)
acquire an exclusive (write) lock, ensuring they are serialized with respect to
all other operations. Note that connect() acquires the write lock only to transition
the socket state to SYN_SENT, then releases the lock and blocks on a wait queue,
allowing concurrent close() or shutdown() to abort the connection attempt.
The SocketOps trait methods all take &self because the dispatch layer guarantees
the correct lock is held before invocation. Implementations use interior mutability
(per-field atomics or fine-grained locks) for their mutable state, as is standard for
Rust traits shared across threads.
This means:
- close() waits for any in-flight recvmsg() or sendmsg() to complete before
proceeding. No use-after-free is possible. To prevent unbounded waits when
recvmsg() is blocked in a long-polling receive, close() sets a SOCK_DEAD
flag (visible to the socket's wait queue) before acquiring the write lock. This
wakes any blocked readers, which check the flag and return -EBADF, releasing
their read locks. This mirrors Linux's sock_flag(sk, SOCK_DEAD) mechanism.
- shutdown() + recvmsg(): shutdown(SHUT_RD) acquires the exclusive lock, sets a
"read-shutdown" flag, and releases the lock. Subsequent recvmsg() calls see the
flag and return 0 (EOF) without waiting for the exclusive lock.
- The RwLock is a per-CPU reader-optimized lock (the shared-lock fast path is a
single atomic increment on the local CPU's counter, adding < 10 ns to data-path
operations).
Socket objects returned by accept() are allocated from a per-CPU slab allocator (the
kernel slab allocator described in Section 12), not the general-purpose heap. SlabRef<T>
is a typed reference into a slab pool, providing O(1) allocation and deallocation without
contending on a global heap lock. This is critical for servers handling millions of
concurrent connections — Box<dyn SocketOps> would introduce a heap allocation on every
accept(), creating allocator contention under load. The slab is pre-sized per socket
type during protocol registration and grows in page-granularity chunks on demand.
Static dispatch for the common path: The SocketOps trait uses dynamic dispatch
(dyn SocketOps) to support runtime protocol registration and heterogeneous socket
collections. However, the common case — TCP sockets using the built-in CUBIC congestion
control — is monomorphized at compile time via generic specialization. The TCP
implementation calls its own concrete methods directly on the hot path (connect, send,
recv); dyn dispatch is only exercised when the socket layer must operate on a
protocol-agnostic socket handle (e.g., epoll readiness checks across mixed socket
types, or the close() path that iterates the fd table). This ensures the TCP fast path
has zero vtable overhead.
Protocol registration happens at isle-net initialization:
/// Register a transport protocol with the socket layer.
/// Called during isle-net init for built-in protocols (TCP, UDP, SCTP, MPTCP).
/// Can also be called at runtime to register dynamically loaded protocols.
pub fn register_protocol(
family: AddressFamily, // AF_INET, AF_INET6
sock_type: SocketType, // SOCK_STREAM, SOCK_DGRAM, SOCK_SEQPACKET
protocol: u16, // IPPROTO_TCP, IPPROTO_UDP, IPPROTO_SCTP, IPPROTO_MPTCP
factory: Box<dyn SocketFactory>,
) -> Result<(), KernelError>;
Adding a new transport (e.g., QUIC kernel offload) requires only: (1) implement
SocketOps, (2) implement SocketFactory, (3) call register_protocol with the
appropriate family/type/protocol tuple. No changes to the socket layer, syscall
dispatch, or any other transport's code.
Linux comparison: Linux's struct proto_ops serves a similar role, but the
implementation is entangled with struct sock internals. Adding MPTCP to Linux
required modifying tcp_input.c, tcp_output.c, the socket layer, and the
connection tracking subsystem. ISLE's trait boundary enforces that transports are
self-contained.
34.3 Congestion Control Framework
Congestion control is pluggable via a trait, selectable per-socket at runtime:
/// Pluggable congestion control algorithm.
/// Each algorithm maintains its own state (cwnd, ssthresh, RTT samples).
///
/// All methods take `&self` (not `&mut self`) because the socket lock is NOT
/// held during the full TX/RX processing path — the TX path may read `cwnd()`
/// while the RX path calls `on_ack()`. Implementations use `AtomicU64` for
/// all mutable state (providing `Sync` as required by the trait bound). On
/// x86 TSO, atomic read-modify-write operations compile to locked instructions
/// (e.g., `LOCK XADD`), incurring a small ~15-20 cycle penalty. This is acceptable
/// on the TCP slow path; high-throughput flows rely on batching to amortize this cost.
///
/// All byte counters use `u64` to support high-BDP networks. At 400 Gbps
/// with 100ms RTT, the bandwidth-delay product is ~5 GB, which overflows
/// `u32` (max ~4.3 GB). Using `u64` avoids silent overflow on datacenter
/// and WAN paths.
pub trait CongestionControl: Send + Sync {
/// Called when ACKs arrive, advancing the send window.
fn on_ack(&self, acked_bytes: u64, rtt: Duration);
/// Called when loss is detected (timeout or SACK-based).
fn on_loss(&self, lost_bytes: u64);
/// Current congestion window in bytes.
fn cwnd(&self) -> u64;
/// Slow-start threshold in bytes.
fn ssthresh(&self) -> u64;
/// Human-readable algorithm name (for setsockopt, procfs).
fn name(&self) -> &'static str;
}
Built-in algorithms:
| Algorithm | Description | Default |
|---|---|---|
| CUBIC | Linux default since 2.6.19. Cubic function for cwnd growth. | Yes |
| BBR | Google's bottleneck bandwidth and RTT-based CC. | No |
| BBRv3 | Revised BBR merging bandwidth and loss models into a single state machine; available from Google's BBR repository (not yet in Linux mainline as of 2026). | No |
| Reno | Classic AIMD (additive increase, multiplicative decrease). | No |
Per-socket selection via setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr") — same
API as Linux. Applications that set congestion control on Linux work identically on
ISLE.
eBPF struct_ops: Custom congestion control algorithms can be loaded at runtime via
eBPF, using the same struct_ops mechanism as Linux 5.6+. An eBPF program implements
the CongestionControl trait methods as BPF functions, which are JIT-compiled and
attached to the per-socket congestion control slot. This enables production A/B testing
of new algorithms without kernel rebuilds — the same workflow used at Meta and Google
on Linux.
34.4 MPTCP as First-Class Transport
MPTCP (RFC 8684) is designed into isle-net from the start, not retrofitted onto an existing TCP implementation. This avoids the years of integration pain that Linux experienced.
Architecture:
MPTCP Connection
/ | \
Subflow 0 Subflow 1 Subflow 2
(WiFi) (LTE) (Ethernet)
| | |
TCP stack TCP stack TCP stack
(per-subflow congestion control)
Key design decisions:
-
Subflow management: A path manager component handles subflow creation and teardown. It monitors available network interfaces and creates subflows when new paths appear (e.g., WiFi connects). Subflow teardown is graceful (DATA_FIN) or abrupt (RST on path failure).
-
Packet scheduler: Distributes data segments across subflows. Built-in policies: round-robin, lowest-RTT (send on the subflow with the shortest current RTT estimate), and redundant (duplicate on all subflows for ultra-low-latency). Scheduler is pluggable via a trait, same pattern as congestion control.
-
Sequence number separation: Connection-level Data Sequence Numbers (DSN) are independent of per-subflow TCP sequence numbers. This is architecturally baked in — the MPTCP layer maintains a DSN-to-subflow-sequence mapping, and the per-subflow TCP machines operate with their own sequence spaces. In Linux, this separation was retrofitted and required careful locking; in ISLE, the type system enforces the distinction (
DataSeqNumvsSubflowSeqNumare distinct newtypes). -
Middlebox fallback: If a middlebox strips MPTCP options from the SYN/ACK, the connection falls back to single-path TCP transparently. The application sees a working connection regardless.
-
Use cases requiring MPTCP: iOS and macOS use MPTCP for seamless WiFi/cellular handoff. Multipath TCP proxies improve connection reliability for mobile clients. WireGuard multipath tunnels bond multiple network paths for increased throughput.
34.5 Domain Switch Overhead Analysis
Clarification: isle-core mediates domain transitions (switching PKRU/POR_EL0 state) but does NOT copy packet data. The NIC driver and isle-net share a zero-copy ring buffer in shared memory (accessible to both domains via PKEY 1). Domain switches occur when the CPU transitions between executing isle-core code (for dispatch/scheduling), NIC driver code (for DMA completion processing), and isle-net code (for TCP/IP processing). The 4 switches represent: (1) isle-core→NIC driver for interrupt dispatch, (2) NIC driver→isle-core on return, (3) isle-core→isle-net for protocol processing, (4) isle-net→isle-core on return. Data flows through shared-memory ring buffers without additional copies.
The network stack (isle-net, Tier 1) runs in its own isolation domain. Every packet traverses two domain boundaries, each requiring an entry and exit switch (4 switches total):
- isle-core to NIC driver and back: domain switch to enter NIC driver domain for interrupt handling (~23 cycles), then domain switch to return to isle-core (~23 cycles)
- isle-core to isle-net and back: domain switch to enter isle-net domain for TCP processing (~23 cycles), then domain switch to return to isle-core for socket delivery (~23 cycles)
For high-throughput networking (100 Gbps), the overhead matters.
Per-packet cost analysis (1500-byte frames at 100 Gbps = ~8.3M packets/sec):
Domain switches per packet: 4 (2 domain entries x 2 switches each)
Cycles per switch: ~23 (WRPKRU, per Section 3)
Total domain switch overhead/packet: ~92 cycles (~20ns at 4.5 GHz)
Time budget per packet: ~120ns (at 8.3M pps)
Domain switch overhead fraction: ~17% (at 4.5 GHz) to ~26% (at 3 GHz)
This 17-26% overhead (depending on clock speed) is unacceptable for production networking. Four mitigations reduce it to a negligible fraction:
- Batching: Process packets in batches of up to 64. Each batch requires only 4 domain switches total (not 4 per packet), because the receiving domain processes the entire batch before returning. This reduces per-packet domain switch overhead to <1 cycle at batch size 64.
- NAPI-style polling: After the first interrupt, switch to polling mode. The NIC driver translates its hardware-specific completion events into standardized KABI completion descriptors. These KABI descriptors are written to a shared isolation domain (the shared read-only PKEY per Section 3's domain allocation table), accessible by both isle-net and the NIC driver. isle-net reads the KABI descriptors directly from the shared domain without a per-packet domain switch. Write access to ring doorbell registers remains in the NIC driver's private domain — isle-net can observe completions but cannot manipulate the hardware directly. This deliberately places the ring descriptors outside the NIC driver's private domain, following the standard ISLE pattern for zero-copy data exchange between domains. No per-packet interrupt or domain switch while in polling mode. The polling-to-interrupt transition uses an adaptive threshold based on packet rate.
- XDP fast path: XDP programs run in the NIC driver's isolation domain, processing packets before they reach isle-net. Packets that are dropped, redirected, or TX-bounced by XDP never incur the driver-to-isle-net domain switch. For workloads like DDoS mitigation where >90% of packets are dropped, this eliminates nearly all domain switches.
- GRO (Generic Receive Offload): Coalesce multiple small packets into larger aggregates before delivery across domain boundaries, amortizing the per-byte domain switch cost across multiple original packets.
Measured overhead target: With batching + NAPI polling active, the domain switch overhead for sustained 100 Gbps throughput is <2% of CPU time. This is comparable to Linux's combined interrupt + softirq overhead for the same workload, making domain isolation effectively free at high packet rates.
35. Network Overlay and Tunneling
Linux problem: Overlay networking (VXLAN, Geneve) was bolted onto the stack over many years. Bridge/veth code is complex and poorly isolated — a bug in the bridge module can crash the kernel.
ISLE design:
Tunnel protocols as isle-net modules — Each tunnel type runs as a Tier 1 module
and implements a TunnelDevice trait:
/// Tunnel device interface for encapsulation protocols.
pub trait TunnelDevice: NetDevice {
/// Encapsulate an inner packet for transmission through the tunnel.
fn encap(&self, inner: &Packet, metadata: &TunnelMetadata) -> Result<Packet>;
/// Decapsulate a received packet, returning the inner packet.
fn decap(&self, outer: &Packet) -> Result<(Packet, TunnelMetadata)>;
/// Maximum overhead added by encapsulation (for MTU calculation).
fn encap_overhead(&self) -> usize;
/// Tunnel-specific metadata (VNI, flow label, key, etc.).
type Metadata: TunnelMetadata;
}
Supported tunnel protocols:
| Protocol | Description | Use case |
|---|---|---|
| VXLAN | Virtual Extensible LAN (UDP port 4789) | Cloud overlay, OpenStack |
| Geneve | Generic Network Virtualization Encap | OVN, next-gen cloud overlay |
| GRE/GRE6 | Generic Routing Encapsulation | Site-to-site tunnels |
| IPIP/SIT | IP-in-IP and IPv6-in-IPv4 | IPv6 transition |
| WireGuard | Modern VPN (ChaCha20-Poly1305) | Secure tunnels |
Software L2 switch — A Linux bridge equivalent in isle-net, supporting: - STP (Spanning Tree Protocol) for loop prevention - VLAN filtering (802.1Q tag-aware forwarding) - FDB (Forwarding Database) learning with configurable aging - Per-port traffic shaping
Virtual device pairs: - veth: Virtual ethernet pairs for namespace connectivity. Required for Docker and Kubernetes pod networking. Each end of the pair lives in a different network namespace. - macvlan/ipvlan: Lightweight container networking without bridges. macvlan assigns unique MACs per container; ipvlan shares the parent MAC and routes by IP.
VRF (Virtual Routing and Forwarding) — L3 domain isolation for multi-tenant routing. Each VRF has its own routing table and forwarding decisions, enabling multiple tenants to use overlapping IP ranges on the same host.
Hardware offload — Tunnel encap/decap can be offloaded to NIC hardware via KABI.
This is the equivalent of Linux TC flower offload:
- NIC firmware handles VXLAN/Geneve encap/decap in hardware
- isle-net falls back to software path transparently if NIC lacks offload support
- Offload rules are programmed via the same TunnelDevice trait (the NIC driver
implements the trait with hardware acceleration)
XDP integration — XDP programs can inspect inner headers of tunneled packets via
a "decap-before-XDP" mode:
- Because XDP runs in the NIC driver before reaching isle-net, XDP programs that need
to see inner headers must explicitly call a BPF helper (e.g., bpf_xdp_decap()).
- This helper invokes the NIC's hardware offload or a fast-path software decapsulator
to strip the tunnel headers.
- The XDP program then sees the inner (original) packet headers.
- Allows filtering/load-balancing decisions based on inner flow information
- This avoids the Linux problem where XDP programs must manually parse tunnel headers
Container networking compatibility — Docker bridge network mode and Kubernetes
CNI plugins (Calico, Cilium, Flannel) must work without modification. This requires:
- veth pair creation via netlink
- Bridge port management via netlink
- VXLAN device creation via netlink
- iptables/nftables rules for masquerade and port mapping
- All of these are covered by the netlink subsystem (Section 35.1) and BPF-based
packet filtering (Section 35.2)
35.1 Netlink Socket Interface
Netlink is the primary kernel-userspace IPC mechanism for network configuration.
Docker, Kubernetes CNI plugins (Calico, Cilium, Flannel), iproute2 (ip route,
ip addr, ip link), and NetworkManager all depend on netlink. ISLE implements
netlink as a socket family within isle-net.
Socket family: AF_NETLINK sockets are created via socket(AF_NETLINK, SOCK_DGRAM,
protocol). Each protocol family controls a different subsystem:
| Protocol | Purpose | Capability required |
|---|---|---|
NETLINK_ROUTE |
Routes, addresses, links, neighbors, rules | CAP_NET_ADMIN for writes; reads are unprivileged |
NETLINK_AUDIT |
Audit event delivery (see Section 40.9) | CAP_AUDIT_READ |
NETLINK_KOBJECT_UEVENT |
Device hotplug events (udev) | Unprivileged (receive only) |
NETLINK_GENERIC |
Generic extensible netlink (genetlink) | Per-family capability check |
Message format: Every netlink message starts with an nlmsghdr (16 bytes):
/// Netlink message header (matches Linux struct nlmsghdr exactly).
#[repr(C)]
pub struct NlMsgHdr {
/// Total message length including header.
pub nlmsg_len: u32,
/// Message type (RTM_NEWROUTE, RTM_DELADDR, etc.).
pub nlmsg_type: u16,
/// Flags (NLM_F_REQUEST, NLM_F_DUMP, NLM_F_ACK, etc.).
pub nlmsg_flags: u16,
/// Sequence number (for request/response matching).
pub nlmsg_seq: u32,
/// Sending process port ID (0 = kernel).
pub nlmsg_pid: u32,
}
Messages are followed by type-specific payload structs (ifinfomsg for links,
ifaddrmsg for addresses, rtmsg for routes) and nested TLV attributes (rtattr).
isle-net implements the full NETLINK_ROUTE message set required for container
networking:
- Link management:
RTM_NEWLINK,RTM_DELLINK,RTM_GETLINK— create/destroy/query veth pairs, bridges, VXLAN devices, macvlan/ipvlan - Address management:
RTM_NEWADDR,RTM_DELADDR,RTM_GETADDR— assign/remove IPv4/IPv6 addresses - Route management:
RTM_NEWROUTE,RTM_DELROUTE,RTM_GETROUTE— manipulate routing tables (including per-VRF tables) - Neighbor management:
RTM_NEWNEIGH,RTM_DELNEIGH,RTM_GETNEIGH— ARP/NDP neighbor table entries - Rule management:
RTM_NEWRULE,RTM_DELRULE— policy routing rules
Capability gating: Netlink write operations require the appropriate capability in the caller's network namespace. Read operations and multicast group subscriptions are unprivileged, matching Linux semantics. This ensures that unprivileged containers can observe network state but cannot modify it without explicit capability grants.
Multicast groups: Processes subscribe to multicast groups (e.g., RTNLGRP_LINK,
RTNLGRP_IPV4_ROUTE) to receive asynchronous notifications of network state changes.
This is how ip monitor, container runtimes, and NetworkManager track link state.
35.2 Packet Filtering (BPF-Based)
ISLE does not implement a separate nftables or iptables subsystem. Packet filtering uses the BPF-based filtering infrastructure described in Section 20.4 (eBPF).
Architecture: All packet filtering hooks (prerouting, input, forward, output, postrouting) are BPF attachment points. BPF programs attached to these hooks perform the equivalent of iptables/nftables rules: matching on headers, NATing, dropping, marking, and logging.
nftables/iptables compatibility: The syscall interface (Section 20)
translates legacy iptables and nftables rule manipulations (setsockopt for iptables,
netlink NFT_MSG_* for nftables) into equivalent BPF programs that are compiled and
attached to the appropriate hooks. This translation happens transparently:
iptables -t nat -A POSTROUTING -s 10.0.0.0/8 -j MASQUERADEis translated to a BPF program attached to the postrouting hook that performs source NATnft add rule ip filter input tcp dport 80 acceptis translated to a BPF program attached to the input hook
This approach provides Docker/Kubernetes compatibility (which depend on iptables/nftables for port mapping, masquerade, and network policy) without maintaining a separate packet filtering subsystem. The BPF JIT ensures that translated rules execute at native speed.
Connection tracking (conntrack): Stateful NAT (MASQUERADE, DNAT, SNAT) requires tracking connection state to map return packets back to the original source. ISLE implements connection tracking as a BPF-accessible hash map maintained by isle-net:
/// 5-tuple identifying one direction of a connection.
/// For ICMP, `src_port` and `dst_port` are repurposed as type and code.
pub struct ConntrackTuple {
/// Source IP address (IPv4-mapped-IPv6 for v4, native for v6).
pub src_addr: [u8; 16], // 16 bytes raw array to guarantee layout
/// Destination IP address.
pub dst_addr: [u8; 16], // 16 bytes
/// Source port (or ICMP type).
pub src_port: u16,
/// Destination port (or ICMP code).
pub dst_port: u16,
/// IP protocol number (TCP=6, UDP=17, ICMP=1, ICMPv6=58, etc.).
pub protocol: u8,
}
/// Connection tracking entry.
/// Keyed by (protocol, src_ip, src_port, dst_ip, dst_port) 5-tuple.
pub struct ConntrackEntry {
/// Original direction 5-tuple.
pub original: ConntrackTuple,
/// Reply direction 5-tuple (after NAT translation).
pub reply: ConntrackTuple,
/// Connection state (NEW, ESTABLISHED, RELATED, INVALID, UNTRACKED).
pub state: ConntrackState,
/// NAT type applied (SNAT, DNAT, MASQUERADE, or None).
pub nat_type: NatType,
/// Conntrack zone (u16, matches Linux `nf_conntrack` zones). Zones allow
/// overlapping IP ranges in different network namespaces to coexist in the
/// same conntrack table without collisions. The zone is part of the hash key.
pub zone: u16,
/// Connection mark (set by iptables CONNMARK target). Used by Kubernetes
/// kube-proxy for service routing and by iptables CONNMARK save/restore.
pub mark: u32,
/// Timeout (nanoseconds since boot). Entry is garbage-collected after expiry.
pub timeout_ns: u64,
/// Packet/byte counters (for accounting).
pub packets_original: u64,
pub packets_reply: u64,
pub bytes_original: u64,
pub bytes_reply: u64,
}
The conntrack table is a concurrent hash map with per-bucket spinlocks and
RCU-protected lookup (matching Linux's nf_conntrack design: a global hash table
with per-bucket locking, not per-CPU sharding, because connection state must be
visible across all CPUs for NAT reply-direction lookups). BPF programs at the prerouting and postrouting hooks query and
update conntrack entries via BPF helper functions (bpf_ct_lookup(), bpf_ct_insert(),
bpf_ct_set_nat()). This integrates with the BPF-based packet filtering: a MASQUERADE
BPF program creates a conntrack entry with SNAT on the outgoing path; the prerouting
hook automatically reverses the NAT for return packets by looking up the conntrack entry.
BPF isolation domain interaction: BPF programs execute in their own isolation domain
(Section 20.4), but conntrack state lives in isle-net's domain. To avoid the ~92-cycle
domain switch overhead per packet, the conntrack hash table is mapped read-only into
the BPF isolation domain. The bpf_ct_lookup() helper executes directly within the BPF
domain without a domain switch, performing a safe read-only lookup. Operations that
mutate conntrack state (bpf_ct_insert(), bpf_ct_set_nat()) are less frequent and
invoke a kernel-mediated helper that dispatches to isle-net via a domain switch.
BPF helper isolation model: The general isolation rules for BPF programs in the networking stack (and all other subsystems) are:
-
Domain confinement: Each BPF program executes in a dedicated BPF isolation domain (Section 20.4), separate from both isle-core and the driver or subsystem that loaded it. An XDP program attached to a NIC driver does not run in the driver's domain — it runs in its own BPF domain and accesses driver or subsystem state only through verified BPF helpers, which perform cross-domain access on the program's behalf. This means a verifier bug in a BPF program cannot compromise the NIC driver's memory or isle-net's internal state. The map access control (rule 2) and capability-gated helpers (rule 3) are enforced by this domain boundary, not solely by the verifier's static analysis.
-
Map access control: BPF maps are owned by the isolation domain that created them. A BPF program can only access maps owned by its own domain. Cross-domain map sharing is explicit: the owning domain grants a capability (with
MAP_READ,MAP_WRITE, or both permission bits) to the target domain via the standard capability delegation mechanism (Section 11.1). The verifier rejects programs that reference map file descriptors for which the loading domain does not hold a valid capability. -
Capability-gated helpers: BPF helpers that access kernel state beyond the program's own domain require the BPF domain to hold the corresponding capability. For example:
bpf_sk_lookup()(socket table lookup) requiresCAP_NET_LOOKUP;bpf_fib_lookup()(route table lookup) requiresCAP_NET_ROUTE_READ;bpf_ct_lookup()/bpf_ct_insert()requireCAP_NET_CONNTRACK. Enforcement is dual: the verifier rejects programs at load time if the BPF domain does not hold the required capabilities (see rule 5), and the eBPF runtime re-checks the domain's capability set at helper invocation time. The runtime check is necessary because capabilities can be revoked after a program is loaded (Section 11.1) — without it, a revoked capability would remain effective until the program is explicitly unloaded. -
Cross-domain packet redirect: XDP redirect actions (
XDP_REDIRECT,bpf_redirect_map()) that forward a packet to an interface in a different driver's isolation domain require the source domain to holdCAP_NET_REDIRECTfor the target interface. Without this capability, the redirect returns-EACCESand the packet is dropped. This prevents a compromised NIC driver from injecting traffic into another driver's domain. -
Verifier enforcement: The verifier enforces constraints (2)–(4) at program load time by checking the BPF domain's capability set against the program's map references and helper calls. Programs that reference inaccessible maps or call helpers requiring capabilities the domain does not hold are rejected before JIT compilation. This is a static gate: it prevents unauthorized programs from being loaded in the first place. Runtime capability checks at helper invocation time (rule 3) serve a distinct purpose: they enforce capability revocation for already-loaded programs. Both mechanisms are primary for their respective concerns — load-time verification prevents unauthorized loading, and runtime checks ensure revocation takes immediate effect.
Linux compatibility: The conntrack subsystem exposes /proc/net/nf_conntrack and
the NETLINK_NETFILTER netlink family for userspace tools (conntrack -L,
conntrack -D). Docker and Kubernetes depend on conntrack for NAT state visibility.
Advantages over separate subsystems: A single filtering mechanism (BPF) eliminates the complexity of maintaining iptables, ip6tables, ebtables, arptables, and nftables as separate subsystems — a major source of bugs and inconsistencies in Linux networking. Connection tracking is the sole stateful component, shared by all BPF-translated NAT rules regardless of their original iptables/nftables syntax.
36. Network Interface Naming
Linux problem: Network interface naming was chaotic (eth0 could be different NICs each boot). systemd's "predictable names" (enp0s3, etc.) partially fixed this but introduced confusing names and edge cases.
ISLE design: - Deterministic, stable device naming in sysfs based on physical topology (bus/slot/function) from the first boot. - The device manager assigns stable names based on firmware (ACPI, Device Tree) hints first, then physical topology, then driver enumeration order as last resort. - User-defined naming rules via a declarative config (similar to udev rules but simpler). - Network namespaces get their own independent naming scope (see Section 63 for network namespace architecture).
36.1 AF_UNIX Socket Specification
Unix domain sockets provide local inter-process communication with semantics that differ from network sockets. ISLE implements the full Linux AF_UNIX interface for compatibility with systemd, D-Bus, X11/Wayland, and container runtimes.
Socket types:
| Type | Semantics | Use case |
|---|---|---|
SOCK_STREAM |
Byte stream, in-order, reliable | D-Bus, systemd socket activation |
SOCK_DGRAM |
Datagram, unordered, unreliable | Logging, low-overhead IPC |
SOCK_SEQPACKET |
Message-preserving stream, in-order, reliable | Protocol-framed IPC (e.g., varlink) |
Address format:
/// Unix domain socket address (matches Linux struct sockaddr_un).
/// Path sockets start with a non-NUL byte; abstract sockets start with NUL.
#[repr(C)]
pub struct SockAddrUnix {
/// Address family (AF_UNIX = 1).
pub sun_family: u16,
/// Path name or abstract name.
/// - Path socket: null-terminated filesystem path (max 107 bytes including NUL)
/// - Abstract socket: sun_path[0] = '\0', followed by abstract name (no filesystem entry)
/// The Linux limit is 108 bytes total (sizeof(sockaddr_un) - 2 for sun_family).
pub sun_path: [u8; 108],
}
Abstract namespace: Names starting with \0 (e.g., \0com.example.app) exist independently of the filesystem. Abstract sockets are destroyed when the last reference closes and are not affected by filesystem operations (unlink, rename). They are scoped to the network namespace (Section 63), providing isolation between containers.
Control messages (SCM_RIGHTS, SCM_CREDENTIALS):
/// Ancillary data types for AF_UNIX sockets.
pub enum UnixControlMsg {
/// Pass file descriptors to the receiver.
/// The sender's fd table entries are duplicated into the receiver's fd table.
/// Fds are closed by the sender after sendmsg() returns (not transferred).
/// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_RIGHTS,
/// cmsg_data = [i32; N] (array of fds)
ScmRights {
/// File descriptors to duplicate (max 253 per message, matching Linux SCM_MAX_FD).
fds: [i32; 253],
/// Number of valid entries in fds.
count: usize,
},
/// Send sender's credentials to the receiver.
/// Requires SOCK_DGRAM or SOCK_SEQPACKET (not SOCK_STREAM, per Linux).
/// The receiver can validate the sender's identity.
/// Receives as: cmsg_level = SOL_SOCKET, cmsg_type = SCM_CREDENTIALS,
/// cmsg_data = struct ucred
ScmCredentials {
/// Sender's PID in the receiver's PID namespace (translated if different).
pid: i32,
/// Sender's UID in the receiver's user namespace.
uid: u32,
/// Sender's GID in the receiver's user namespace.
gid: u32,
},
}
/// Credential structure for SCM_CREDENTIALS (matches Linux struct ucred).
#[repr(C)]
pub struct UCred {
pub pid: i32,
pub uid: u32,
pub gid: u32,
}
SO_PEERCRED: The getsockopt(SOL_SOCKET, SO_PEERCRED, ...) call retrieves the credentials of the peer process at connect() time. This is the standard authentication mechanism for D-Bus and systemd. The credentials are snapshotted when the connection is established and do not change if the peer later calls setuid() or exits.
Socketpair: socketpair(AF_UNIX, type, 0, sv) creates a connected pair of unnamed sockets. Both ends are interchangeable (no client/server distinction). Used for pthreads IPC, async I/O notification pipes, and subprocess communication.
Autobind: Binding to an empty address (sun_path[0] = '\0' with length 2) triggers autobind, which assigns a unique abstract name \0<inode>. This is used for unnamed socket peers that need a bindable address for sockname().