Skip to content

supervisor

supervisor

Per-container supervisor — one process per container, lives for its lifetime.

Composes the terok-vault, clearance-hub, and verdict-server into a single in-process composition built per container. Spawned by the OCI prestart hook (see terok_sandbox.resources.hooks.supervisor_hook) through the restart-loop wrapper (see terok_sandbox.resources.supervisor_wrapper); exits when podman wait returns.

The entry point is run_supervisorterok-sandbox supervisor <container_id> <sidecar_path> invokes it under asyncio.run with both arguments.

__all__ = ['SidecarConfig', 'SupervisorPaths', 'load_sidecar', 'run_supervisor'] module-attribute

SidecarConfig(container_name, ipc_mode, db_path, runtime_dir, scope_id=None, project_id='', task_id='', tcp_port=None, ssh_signer_port=None, gate_port=None, gate_base_path=None, gate_token=None, dossier_path=None) dataclass

Per-container config the supervisor reads from the sidecar JSON.

Written by terok-sandbox prepare (and equivalents in terok-executor / terok) at container-creation time. Keyed by container name initially; promoted to a container-ID-keyed filename on first hook fire (see terok_sandbox.resources.hooks._supervisor_state.load_sidecar).

container_name instance-attribute

ipc_mode instance-attribute

db_path instance-attribute

runtime_dir instance-attribute

/run/user/<host_uid>/terok/sandbox — pinned by the launch path because the supervisor cannot re-derive it from inside crun's rootless user namespace (its os.getuid() is 0 there, which misroutes generic resolvers to the root-only /run/terok).

scope_id = None class-attribute instance-attribute

project_id = '' class-attribute instance-attribute

task_id = '' class-attribute instance-attribute

tcp_port = None class-attribute instance-attribute

Per-container TCP port for the vault proxy in TCP mode. None in socket mode (the path is derived from the container ID, not carried here).

ssh_signer_port = None class-attribute instance-attribute

Per-container TCP port for the SSH signer in TCP mode. None in socket mode.

gate_port = None class-attribute instance-attribute

Per-container TCP port for the git gate in TCP mode. None in socket mode.

gate_base_path = None class-attribute instance-attribute

Directory holding the shared per-project bare mirrors (<gate_base_path>/<project_id>.git). None when the gate is not wired for this container.

gate_token = None class-attribute instance-attribute

The single token the gate validates. Travels only via the sidecar; None when the gate is not wired.

dossier_path = None class-attribute instance-attribute

SupervisorPaths(container_id, container_runtime_dir, vault_socket, ssh_signer_socket, gate_socket, clearance_socket, events_socket, verdict_socket, control_socket, log_path) dataclass

Resolved per-container socket / log / pid locations.

Computed once at supervisor startup from the container ID and the runtime/state dirs the sidecar config doesn't carry directly.

container_id instance-attribute

container_runtime_dir instance-attribute

Per-container directory holding vault.sock and ssh-agent.sock. Keyed on container_name (which the launch path knows before podman run so it can pre-create the dir and bind-mount it as /run/terok/ inside the container). Different containers get different host dirs; the in-container view of these sockets is always /run/terok/.

vault_socket instance-attribute

ssh_signer_socket instance-attribute

gate_socket instance-attribute

Per-container git-gate Unix socket inside container_runtime_dir (= the in-container /run/terok). Used only in socket mode; in TCP mode the gate binds a loopback port instead.

clearance_socket instance-attribute

events_socket instance-attribute

Per-container ingester socket the shield reader and shield up/down push raw line-JSON to. Distinct from clearance_socket (the varlink subscriber socket operator UIs glob): the reader speaks line-JSON, not varlink, so the produce and subscribe roles need separate sockets.

verdict_socket instance-attribute

control_socket instance-attribute

log_path instance-attribute

for_container(container_id, container_name, sidecar_path, runtime_dir) classmethod

Build the per-container path bundle.

Both anchors come from the launch path — neither is re-resolved inside the supervisor:

  • runtime_dir (/run/user/<host_uid>/terok/sandbox) for per-container sockets; carried in the sidecar because the rootless user namespace makes generic is_root-based resolvers (terok_util.namespace_runtime_dir) misroute to /run/terok.
  • sidecar_path's grandparent for the persistent log file; honours whatever paths.root resolved to when the launch path wrote the sidecar.

Sockets carry the 12-char short container ID (podman's display convention) rather than the full UUID — AF_UNIX's sun_path is 108 bytes including the null terminator, and <terok-runtime>/clearance/<64-char-uuid>.sock lands at or past that limit. Twelve characters of hex give 48 bits of entropy, well past the no-collisions-within-one-host bar. Logs keep the full UUID because they live on the filesystem with no AF_UNIX limit and the full UUID is easier to grep.

Clearance / verdict / control sockets live at the cross-package <terok>/ runtime root (parent of the sandbox-namespaced runtime_dir) because they're owned by terok-clearance semantically and consumed by every package that subscribes (terok-shield's NFLOG reader, terok-clearance TUI, …). Sandbox-specific sockets (vault, ssh-agent) live in a per-container runtime_dir/run/<short_id>/ directory the launch path bind-mounts at /run/terok/ inside the container — keeping every container's sockets distinct on the host so concurrent containers don't collide.

Source code in src/terok_sandbox/supervisor/main.py
@classmethod
def for_container(
    cls,
    container_id: str,
    container_name: str,
    sidecar_path: Path,
    runtime_dir: Path,
) -> SupervisorPaths:
    """Build the per-container path bundle.

    Both anchors come from the launch path — neither is re-resolved
    inside the supervisor:

    * *runtime_dir* (``/run/user/<host_uid>/terok/sandbox``) for
      per-container sockets; carried in the sidecar because the
      rootless user namespace makes generic ``is_root``-based
      resolvers (``terok_util.namespace_runtime_dir``) misroute to
      ``/run/terok``.
    * *sidecar_path*'s grandparent for the persistent log file;
      honours whatever ``paths.root`` resolved to when the launch
      path wrote the sidecar.

    Sockets carry the **12-char short container ID** (podman's
    display convention) rather than the full UUID — AF_UNIX's
    ``sun_path`` is 108 bytes including the null terminator, and
    ``<terok-runtime>/clearance/<64-char-uuid>.sock`` lands at or
    past that limit.  Twelve characters of hex give 48 bits of
    entropy, well past the no-collisions-within-one-host bar.
    Logs keep the full UUID because they live on the filesystem
    with no AF_UNIX limit and the full UUID is easier to grep.

    Clearance / verdict / control sockets live at the
    cross-package ``<terok>/`` runtime root (parent of the
    sandbox-namespaced *runtime_dir*) because they're owned by
    terok-clearance semantically and consumed by every package
    that subscribes (terok-shield's NFLOG reader, terok-clearance
    TUI, …).  Sandbox-specific sockets (vault, ssh-agent) live in
    a per-container ``runtime_dir/run/<short_id>/`` directory the
    launch path bind-mounts at ``/run/terok/`` inside the
    container — keeping every container's sockets distinct on
    the host so concurrent containers don't collide.
    """
    short_id = container_id[:12]
    clearance_root = runtime_dir.parent  # <terok>/sandbox/  →  <terok>/
    state_anchor = sidecar_path.parent.parent  # <root>/sidecar/<name>.json → <root>
    # vault + ssh-agent are keyed on container_name (known at
    # launch time, before podman assigns the ID) so the launch
    # path can pre-create the dir and bind-mount it.  Clearance /
    # verdict / control use the 12-char container ID short prefix
    # because they're cross-package (shield's NFLOG reader keys
    # on it too).
    container_runtime_dir = runtime_dir / "run" / container_name
    return cls(
        container_id=container_id,
        container_runtime_dir=container_runtime_dir,
        vault_socket=container_runtime_dir / "vault.sock",
        ssh_signer_socket=container_runtime_dir / "ssh-agent.sock",
        gate_socket=container_runtime_dir / "gate-server.sock",
        clearance_socket=clearance_root / "clearance" / f"{short_id}.sock",
        events_socket=clearance_root / "events" / f"{short_id}.sock",
        verdict_socket=clearance_root / "verdict" / f"{short_id}.sock",
        control_socket=clearance_root / "control" / f"{short_id}.sock",
        log_path=state_anchor / "logs" / f"{container_id}.log",
    )

load_sidecar(sidecar_path)

Read and parse the sidecar JSON at sidecar_path.

The OCI hook pinned this exact path via the terok.sandbox.sidecar annotation, so the supervisor never guesses — it opens the named file directly. Returns None on any I/O / schema failure; run_supervisor surfaces that as exit-code 2.

Source code in src/terok_sandbox/supervisor/main.py
def load_sidecar(sidecar_path: Path) -> SidecarConfig | None:
    """Read and parse the sidecar JSON at *sidecar_path*.

    The OCI hook pinned this exact path via the
    ``terok.sandbox.sidecar`` annotation, so the supervisor never
    guesses — it opens the named file directly.  Returns ``None`` on
    any I/O / schema failure; ``run_supervisor`` surfaces that as
    exit-code 2.
    """
    try:
        with sidecar_path.open(encoding="utf-8") as fh:
            raw = json.load(fh)
    except (OSError, ValueError):
        _logger.exception("sidecar parse failure for %s", sidecar_path)
        return None
    if not isinstance(raw, dict):
        _logger.error("sidecar is not a JSON object: %s", sidecar_path)
        return None
    try:
        container_name = str(raw.get("container_name", "")).strip()
        if not container_name:
            _logger.error("sidecar missing required container_name: %s", sidecar_path)
            return None
        # ``container_name`` is interpolated into ``runtime_dir/run/<name>``
        # and that directory is mkdir'd, chmod'd, and rmtree'd by the
        # supervisor.  Reject a name that is absolute or carries a path
        # separator / parent-dir reference so a malformed sidecar can't
        # redirect those filesystem operations outside the runtime dir.
        if "/" in container_name or container_name in (".", ".."):
            _logger.error(
                "sidecar container_name is not a safe path component, got %r: %s",
                container_name,
                sidecar_path,
            )
            return None
        ipc_mode = str(raw.get("ipc_mode", "socket"))
        if ipc_mode not in ("socket", "tcp"):
            _logger.error(
                "sidecar ipc_mode must be 'socket' or 'tcp', got %r: %s",
                ipc_mode,
                sidecar_path,
            )
            return None
        # Refuse relative paths — the supervisor takes ``rmtree`` over
        # ``runtime_dir`` and binds sockets under it; a malformed sidecar
        # with a relative path would resolve against whatever cwd the
        # OCI hook spawned us with (typically ``/``), which is nowhere
        # we want to touch.
        db_path = _require_absolute_path(raw, "db_path", sidecar_path)
        runtime_dir = _require_absolute_path(raw, "runtime_dir", sidecar_path)
        if db_path is None or runtime_dir is None:
            return None
        dossier_raw = raw.get("dossier_path")
        if dossier_raw:
            dossier_path = _require_absolute_path(raw, "dossier_path", sidecar_path)
            if dossier_path is None:
                return None
        else:
            dossier_path = None
        # ``gate_base_path`` becomes ``git http-backend``'s
        # ``GIT_PROJECT_ROOT`` — refuse a relative path for the same
        # reason as ``db_path`` / ``runtime_dir``.
        if raw.get("gate_base_path"):
            gate_base_path = _require_absolute_path(raw, "gate_base_path", sidecar_path)
            if gate_base_path is None:
                return None
        else:
            gate_base_path = None
        return SidecarConfig(
            container_name=container_name,
            ipc_mode=ipc_mode,
            db_path=db_path,
            runtime_dir=runtime_dir,
            scope_id=raw.get("scope_id") or None,
            project_id=str(raw.get("project_id") or ""),
            task_id=str(raw.get("task_id") or ""),
            tcp_port=(int(raw["tcp_port"]) if raw.get("tcp_port") is not None else None),
            ssh_signer_port=(
                int(raw["ssh_signer_port"]) if raw.get("ssh_signer_port") is not None else None
            ),
            gate_port=(int(raw["gate_port"]) if raw.get("gate_port") is not None else None),
            gate_base_path=gate_base_path,
            gate_token=(str(raw["gate_token"]) if raw.get("gate_token") else None),
            dossier_path=dossier_path,
        )
    except (KeyError, TypeError, ValueError):
        _logger.exception("sidecar schema error in %s", sidecar_path)
        return None

run_supervisor(container_id, sidecar_path) async

Compose + run the per-container service bundle.

Lifecycle:

  1. Load the sidecar JSON from sidecar_path; bail with exit code 2 on parse / missing.
  2. Bring the _Services bundle up in dependency order; a startup failure unwinds anything already started and returns exit code 3.
  3. Install SIGTERM / SIGINT handlers that race with podman wait so a host-side terok-sandbox supervisor invocation can be stopped cleanly with Ctrl-C.
  4. Await podman wait <container_id>. When it returns, tear the bundle down in reverse and return 0.

The function is the sole supervisor entry point; the CLI verb terok-sandbox supervisor invokes it via asyncio.run.

Source code in src/terok_sandbox/supervisor/main.py
async def run_supervisor(container_id: str, sidecar_path: Path) -> int:
    """Compose + run the per-container service bundle.

    Lifecycle:

    1. Load the sidecar JSON from *sidecar_path*; bail with exit code
       2 on parse / missing.
    2. Bring the `_Services`
       bundle up in dependency order; a startup failure unwinds anything
       already started and returns exit code 3.
    3. Install SIGTERM / SIGINT handlers that race with ``podman wait``
       so a host-side ``terok-sandbox supervisor`` invocation can be
       stopped cleanly with Ctrl-C.
    4. Await ``podman wait <container_id>``.  When it returns, tear the
       bundle down in reverse and return 0.

    The function is the sole supervisor entry point; the CLI verb
    ``terok-sandbox supervisor`` invokes it via ``asyncio.run``.
    """
    cfg = load_sidecar(sidecar_path)
    if cfg is None:
        _logger.error(
            "no usable sidecar at %s — aborting supervisor for %s",
            sidecar_path,
            container_id,
        )
        return 2

    paths = SupervisorPaths.for_container(
        container_id, cfg.container_name, sidecar_path, cfg.runtime_dir
    )
    for sock in (
        paths.clearance_socket,
        paths.events_socket,
        paths.verdict_socket,
        paths.control_socket,
        paths.vault_socket,
        paths.ssh_signer_socket,
        paths.gate_socket,
    ):
        sock.parent.mkdir(parents=True, exist_ok=True)
        # ``bind_hardened`` refuses group/world-accessible parents;
        # explicit chmod overrides crun's permissive rootless umask.
        sock.parent.chmod(0o700)

    services = _Services()
    stop_event = asyncio.Event()
    _install_signal_handlers(stop_event)

    try:
        try:
            await services.start(cfg, paths)
        except Exception:
            _logger.exception("supervisor failed to start services for %s", container_id)
            await services.stop()
            return 3

        wait_task = asyncio.create_task(_wait_for_container(container_id))
        stop_task = asyncio.create_task(stop_event.wait())
        done, pending = await asyncio.wait(
            {wait_task, stop_task}, return_when=asyncio.FIRST_COMPLETED
        )
        for task in pending:
            task.cancel()
        # Await every task — including the cancelled ones — so the
        # ``podman wait`` subprocess in ``_wait_for_container`` gets a
        # chance to terminate cleanly via its CancelledError handler.
        # Skipping this would orphan the subprocess on stop-signal paths.
        for task in (*done, *pending):
            try:
                await task
            except asyncio.CancelledError:
                pass
            except Exception:
                _logger.exception("supervisor wait task raised for %s", container_id)
        await services.stop()
        return 0
    finally:
        # rmtree the per-container dir on every exit path — startup
        # failure included — so a half-bound socket directory can't
        # outlive the supervisor and confuse the next launch.
        shutil.rmtree(paths.container_runtime_dir, ignore_errors=True)