Create a Lightweight Process Supervisor in Rust to Protect Critical Services from 'Process Roulette'
rustsystemsreliability

Create a Lightweight Process Supervisor in Rust to Protect Critical Services from 'Process Roulette'

UUnknown
2026-03-10
9 min read
Advertisement

Build a small Rust process supervisor that restarts critical apps, logs structured telemetry, and defends against accidental kills.

Stop "Process Roulette" — build a lightweight Rust supervisor that keeps critical services alive

Hook: You’ve seen it — a stray kill command, an overzealous utility, or a prank program randomly terminating processes and taking your desktop or critical service with it. In 2026, teams need resilient, low-friction tools that protect important processes without the heavy-handed complexity of systemd units or sprawling orchestration stacks. This guide shows you how to build a dependable, lightweight process supervisor in Rust that watches, restarts, and logs telemetry for critical processes — reducing downtime from accidental process-killing events often called “process roulette.”

The problem in one line

Tools that randomly kill processes and accidental SIGKILLs are unavoidable; you can’t catch SIGKILL. But you can minimize damage by supervising important processes, restarting them quickly, and exporting telemetry so you can act before users notice.

Why Rust for a process supervisor in 2026?

  • Performance & low overhead: a small binary with deterministic memory behavior is ideal for desktop and edge use.
  • Safety: Rust reduces crashes in the supervisor itself — critical when the supervisor is the last line of defense.
  • Interoperability: Rust's ecosystem (tokio, serde, nix) makes it easy to write async monitoring, structured telemetry, and cross-platform support.
  • Trends in 2025–2026: Organizations favor minimal, observable agents at the edge. eBPF-based observability and Rust-based agents have gained traction for their low overhead and safety guarantees.

What this supervisor will do (requirements)

  1. Launch and monitor a configured process (or command).
  2. Restart it automatically on exit, with exponential backoff to avoid rapid crash loops.
  3. Emit structured telemetry (JSON) to stdout/file and an optional HTTP metrics endpoint for Prometheus.
  4. Handle signals gracefully to shut down children when the machine is rebooting or user logs out.
  5. Keep the footprint tiny so it can be used as a desktop stability tool or a lightweight systemd alternative on machines where full systemd config is onerous.

Design constraints & security notes

Important: You cannot prevent SIGKILL or kernel-initiated OOM kills. The supervisor's goal is fast recovery plus observability. To reduce accidental kills, run the supervisor with appropriate permissions and avoid spawning child processes as root unless necessary.

  • To protect against prank tools, run the supervisor as a user-level daemon or use OS-level protections (Linux capabilities, Windows job objects).
  • Do not attempt to circumvent kernel signals; instead rely on fast restart, health checks, and telemetry.
  • Drop privileges for child processes when possible and use namespaces/cgroups to limit blast radius.

Implementation walkthrough — core Rust supervisor

We’ll use async Rust with tokio for timers and serde_json for telemetry. Use the nix crate for Unix signal handling. The core loop launches a child, waits for exit, records telemetry, and restarts with backoff.

Cargo dependencies (Cargo.toml)

[dependencies]
tokio = { version = "1.35", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
nix = "0.27"
thiserror = "1.0"
prometheus = "0.14" # optional for metrics endpoint
tracing = "0.2"
tracing-subscriber = "0.3"

Supervisor core (simplified)

use std::process::Stdio;
use std::time::Duration;
use serde::Serialize;
use tokio::process::Command;
use tokio::time::sleep;
use tracing::{info, error};

#[derive(Serialize)]
struct Telemetry {
    event: &'static str,
    cmd: String,
    attempt: u32,
    exit_code: Option,
    reason: Option,
}

async fn run_supervisor(cmd: Vec) -> Result<(), Box<dyn std::error::Error>> {
    let mut attempt: u32 = 0;
    let mut backoff = Duration::from_millis(500);

    loop {
        attempt += 1;
        let mut c = Command::new(&cmd[0]);
        if cmd.len() > 1 {
            c.args(&cmd[1..]);
        }
        c.stdout(Stdio::inherit()).stderr(Stdio::inherit());

        info!(%attempt, ?cmd = &cmd, "starting child");

        let mut child = c.spawn()?;
        let status = child.wait().await;

        let (exit_code, reason) = match status {
            Ok(s) => (s.code(), None),
            Err(e) => (None, Some(e.to_string())),
        };

        let t = Telemetry { event: "process_exit", cmd: cmd.join(" "), attempt, exit_code, reason };
        println!("{}", serde_json::to_string(&t)?);

        // Exponential backoff to avoid tight respawn loops
        sleep(backoff).await;
        backoff = std::cmp::min(backoff * 2, Duration::from_secs(30));
    }
}

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init();
    let args: Vec = std::env::args().skip(1).collect();
    if args.is_empty() {
        eprintln!("Usage: supervisor <command> [args...]");
        std::process::exit(2);
    }

    if let Err(e) = run_supervisor(args).await {
        error!(%e, "supervisor failed");
        std::process::exit(1);
    }
}

This minimal supervisor handles the core loop. It prints structured JSON telemetry to stdout and uses an exponential backoff to avoid crash loops. Next, we’ll add signal handling and an optional metrics endpoint.

Graceful shutdown and signal handling

Handle SIGINT and SIGTERM to terminate children before exiting. On Linux, use the nix crate and tokio signal support.

use tokio::signal::unix::{signal, SignalKind};

async fn install_signal_handler() {
    let mut sigterm = signal(SignalKind::terminate()).unwrap();
    let mut sigint = signal(SignalKind::interrupt()).unwrap();

    tokio::spawn(async move {
        tokio::select! {
            _ = sigterm.recv() => tracing::info!("received SIGTERM, exiting"),
            _ = sigint.recv() => tracing::info!("received SIGINT, exiting"),
        }
        // send termination to children or set a shutdown flag
        // (Design: supervise child via a shared Mutex/Arc so we can kill it here.)
    });
}

Telemetry & metrics — make failures visible

Logs alone aren’t enough. In 2026, observability is table stakes. Export small, structured telemetry bundles and optionally a Prometheus-compatible metrics endpoint so existing monitoring stacks can alert on crash rates.

Simple telemetry design

  • Write one JSON line per event to stdout (easy to collect with systemd/journald or a log forwarder like Vector).
  • Include fields: timestamp, event, cmd, attempt, exit_code, reason, uptime_since_last_start.
  • Rotate logs externally; keep supervisor log footprint small.

Optional Prometheus metrics

Use the prometheus crate to expose counters like process_restarts_total, last_exit_code, and a gauge for restart_backoff_seconds, served from a tiny HTTP endpoint.

Advanced strategies (real-world hardening)

Below are proven strategies used by production desktop/edge supervisors as of 2026.

1) Health checks & liveness probes

Don’t rely only on process exit. Implement application-level health probes (HTTP, unix socket) that the supervisor can poll. If a process is alive but unhealthy, restart it on failing the probe N times.

2) Crash-loop prevention (circuit breaker)

Stop restarting if the process crashes too often within a short window. Move into a backoff state and alert:

  • On 5 crashes within 2 minutes, mark service unhealthy and require manual intervention or longer cool-down.

3) Different restart policies

  • always: restart unconditionally (good for resilient services)
  • on-failure: restart only if exit code != 0
  • on-crash: restart only for non-clean exits

4) Desktop-specific protections

On desktop environments, accidental kills often come from user tools. Use these tactics:

  • Launch the supervised app as a child of the supervisor and keep it in a unique process group; this helps targeted signaling when shutting down.
  • For Windows, use Job Objects to ensure children are terminated with the supervisor or to detect process terminations from external force.
  • Set clear UX expectations — show notifications when the supervisor restarts an app so users know why a window disappeared and reappeared.

5) Observability with eBPF (optional)

In 2026, eBPF can be used to find sources of external process terminations (connect kill events to caller PIDs). If you need to trace who is killing processes, capture audit data or use eBPF tracing agents — but keep the core supervisor simple and portable.

Integrating with existing service managers

Use the supervisor as a complement to systemd, launchd, or Windows Services. Options:

  • Run the supervisor under systemd as a simple unit; systemd manages the supervisor while the supervisor manages the application’s restarts and telemetry.
  • For single-user desktop installs, use an XDG autostart entry or a launch agent that keeps the supervisor alive.

Example systemd unit (user scope)

[Unit]
Description=My App Supervisor (user)

[Service]
ExecStart=/usr/local/bin/my-supervisor /usr/bin/my-critical-app --flag
Restart=on-failure

[Install]
WantedBy=default.target

Real-world checklist before deploying

  1. Decide restart policy and crash-loop thresholds.
  2. Enable structured telemetry and connect to your log pipeline (Vector, Loki, or plain file forwarder).
  3. Expose metrics for monitoring and set alerts on high restart rates.
  4. Run the supervisor as a least-privilege user and use OS primitives to limit child permissions.
  5. Include a manual recovery step in documentation (how to stop, inspect logs, and restart without automatic restarts).

Why not just use systemd?

Systemd is powerful, but it can be heavy for some scenarios (single-user desktops, small edge devices, or when you need app-specific health checks and bespoke telemetry). A tiny Rust supervisor is:

  • Easy to ship with your app binary
  • Portable between Linux distributions and embed within installers
  • Simple to reason about and audit compared to complex unit files

Limitations — be honest

Don't promise to stop SIGKILL. You can’t. If the kernel kills a process (OOM, SIGKILL), the supervisor can only observe the exit and restart. If the supervisor itself is killed, you need an external launcher (systemd, launchd, Windows Service manager) to ensure the supervisor is relaunched on boot.

Observability playbook — when a process keeps getting killed

  1. Check supervisor telemetry to see timestamps and exit codes.
  2. If exit codes are missing, inspect dmesg/journalctl for OOM kills or kernel messages.
  3. Use eBPF or audit logs to identify the PID/user that issued the kill.
  4. If kills are intentional and legitimate, consider adjusting policies (e.g., allow restart-on-failure for certain error ranges).
  • Rust continues to dominate small, safe agents — expect more libraries that make supervisor features (health checks, metrics) plug-and-play.
  • eBPF observability will be the standard way to attribute external process terminations.
  • WASM sandboxes may host desktop apps — supervisors will evolve to manage sandboxed workloads and communicate via standardized health endpoints.
  • Edge devices will increasingly rely on small Rust guardians that bridge between local processes and centralized observability planes.

Actionable takeaways

  • Build a supervisor in Rust when you need a small, safe agent to keep critical processes alive on desktops or edge devices.
  • Emit one-line JSON telemetry for easy ingestion and add a Prometheus metrics endpoint if you run monitoring stacks.
  • Implement exponential backoff and a circuit breaker to avoid crash loops.
  • Use signal handling to gracefully stop children and don’t attempt to catch SIGKILL.
  • When persistent unexpected kills occur, combine supervisor telemetry, kernel logs, and eBPF/audit data to find the root cause.
"You can’t prevent every kill, but you can make crashes irrelevant to users by restarting fast and making failures visible."

Next steps — build and iterate

Clone a repo scaffold (or create one) with the minimal supervisor above. Add these features iteratively:

  • Graceful shutdown and child process management via shared state
  • Structured telemetry + log rotation hooks
  • Metrics endpoint for Prometheus and sensible alerts
  • Optional eBPF integration for attribution

Call to action

Try the pattern: compile the example supervisor, run it as a user-level agent for a critical desktop app, and connect logs to your telemetry pipeline. Ship the supervisor with your app, iterate on health checks, and push metrics to your monitoring system. If you want a jumpstart, fork the sample and add platform-specific features (Windows job objects, macOS launch agents). Share your improvements back to the community — small resilient agents are a key part of modern SRE and desktop stability practices in 2026.

Advertisement

Related Topics

#rust#systems#reliability
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-10T01:00:59.263Z