Skip to content

doctor

doctor

Container health check protocol and sandbox-level diagnostics.

Defines the shared DoctorCheck / CheckVerdict protocol used across the terok package chain (sandbox → agent → terok). Each package contributes domain-specific checks; the top-level terok sickbay orchestrates execution inside containers via podman exec.

Sandbox-level checks verify host-side service reachability from within a container (vault token broker TCP, SSH signer TCP) and shield firewall state.

CheckVerdict(severity, detail, fixable=False) dataclass

Result of evaluating a single health check probe.

severity instance-attribute

"ok", "warn", or "error".

detail instance-attribute

Human-readable explanation.

fixable = False class-attribute instance-attribute

Whether fix_cmd should be offered to the operator.

DoctorCheck(category, label, probe_cmd, evaluate, fix_cmd=None, fix_description='', host_side=False) dataclass

A single health check to run inside (or against) a container.

The probe_cmd is executed via podman exec <cname> ... by the orchestrator. The evaluate callable interprets the result. If fix_cmd is set, the orchestrator may offer it when the check fails with fixable=True.

Dual execution modes:

  • Container mode (host_side=False): the orchestrator runs probe_cmd via podman exec and passes the result to evaluate. The standalone doctor command runs the same probe_cmd directly via subprocess on the host.
  • Host-side mode (host_side=True): the orchestrator bypasses probe_cmd entirely and performs the check via Python APIs (e.g. ShieldManager), then passes resolved state to evaluate. The standalone doctor command calls evaluate(0, "", "") and the function performs the check itself or reports a neutral result.

category instance-attribute

Grouping key: "bridge", "env", "mount", "network", "shield", "git".

label instance-attribute

Human-readable check name shown in output.

probe_cmd instance-attribute

Shell command to run inside the container via podman exec.

evaluate instance-attribute

(returncode, stdout, stderr) → CheckVerdict.

fix_cmd = None class-attribute instance-attribute

Optional remediation command for podman exec.

fix_description = '' class-attribute instance-attribute

Shown to the operator before applying the fix.

host_side = False class-attribute instance-attribute

If True, the check runs on the host (not via podman exec). The orchestrator calls evaluate(0, "", "") and the evaluate function performs the host-side check itself.

sandbox_doctor_checks(*, token_broker_port=None, ssh_signer_port=None, desired_shield_state=None)

Return sandbox-level health checks for in-container diagnostics.

Parameters:

Name Type Description Default
token_broker_port int | None

Token broker TCP port (skip check if None).

None
ssh_signer_port int | None

SSH signer TCP port (skip check if None).

None
desired_shield_state str | None

Expected shield state from shield_desired_state file ("up", "down", "disengaged", or None to skip).

None

Returns:

Type Description
list[DoctorCheck]

List of DoctorCheck instances ready for orchestration.

Source code in src/terok_sandbox/doctor.py
def sandbox_doctor_checks(
    *,
    token_broker_port: int | None = None,
    ssh_signer_port: int | None = None,
    desired_shield_state: str | None = None,
) -> list[DoctorCheck]:
    """Return sandbox-level health checks for in-container diagnostics.

    Args:
        token_broker_port: Token broker TCP port (skip check if ``None``).
        ssh_signer_port: SSH signer TCP port (skip check if ``None``).
        desired_shield_state: Expected shield state from ``shield_desired_state``
            file (``"up"``, ``"down"``, ``"disengaged"``, or ``None`` to skip).

    Returns:
        List of [`DoctorCheck`][terok_sandbox.doctor.DoctorCheck] instances ready for orchestration.
    """
    checks: list[DoctorCheck] = [
        _make_vault_unlocked_check(),
        _make_plaintext_passphrase_warning_check(),
    ]
    if token_broker_port is not None:
        checks.append(_make_token_broker_check(token_broker_port))
    if ssh_signer_port is not None:
        checks.append(_make_ssh_signer_check(ssh_signer_port))
    checks.append(_make_shield_check(desired_shield_state))
    return checks

make_recovery_acknowledged_check()

Warn when the operator hasn't confirmed they saved the recovery key.

Two severity bands depending on the resolved tier when the marker is absent — the session-file tier dies on the next reboot, so "unconfirmed AND session-only" is a genuine error (you are literally one reboot away from losing the vault), while every durable tier (keyring, systemd-creds, config) is "only" a warn (machine-bound; needs an off-host copy for disaster recovery).

Intentionally NOT bundled into sandbox_doctor_checks: that list is consumed per-container by terok's sickbay, and a host-bound recovery check would render once per task. Top-level callers (the terok-sandbox doctor CLI, terok's host-level sickbay row) invoke this factory directly so the check renders exactly once.

Source code in src/terok_sandbox/doctor.py
def make_recovery_acknowledged_check() -> DoctorCheck:
    """Warn when the operator hasn't confirmed they saved the recovery key.

    Two severity bands depending on the resolved tier when the marker
    is absent — the session-file tier dies on the next reboot, so
    "unconfirmed AND session-only" is a genuine ``error`` (you are
    literally one reboot away from losing the vault), while every
    durable tier (keyring, systemd-creds, config) is "only" a ``warn``
    (machine-bound; needs an off-host copy for disaster recovery).

    Intentionally NOT bundled into
    [`sandbox_doctor_checks`][terok_sandbox.doctor.sandbox_doctor_checks]:
    that list is consumed per-container by terok's sickbay, and a
    host-bound recovery check would render once per task.  Top-level
    callers (the ``terok-sandbox doctor`` CLI, terok's host-level
    sickbay row) invoke this factory directly so the check renders
    exactly once.
    """

    def _eval(_rc: int, _stdout: str, _stderr: str) -> CheckVerdict:
        from ._stage import bold  # noqa: PLC0415
        from .vault.store.recovery import RecoveryStatus  # noqa: PLC0415

        status = RecoveryStatus.load()
        if status.acknowledged:
            return CheckVerdict("ok", "recovery key acknowledged")
        reveal = bold("terok-sandbox vault passphrase reveal")
        ack = bold("terok-sandbox vault passphrase acknowledge")
        if status.session_only:
            return CheckVerdict(
                "error",
                "vault recovery key UNCONFIRMED and the passphrase lives ONLY"
                " in the session-unlock tmpfs file — it will be wiped on the"
                " next reboot and your vault becomes UNRECOVERABLE then."
                f" Run {reveal} NOW and save the value off-host,"
                f" or {ack} if you already captured it.",
            )
        return CheckVerdict(
            "warn",
            "vault recovery key unconfirmed — every keystore tier is"
            " machine-bound, so a hardware failure strands the vault."
            f" Run {reveal} to view and save the value off-host,"
            f" or {ack} if you already captured it.",
        )

    return DoctorCheck(
        category="vault",
        label="Recovery key acknowledged",
        probe_cmd=[],
        evaluate=_eval,
        host_side=True,
        fix_description=(
            "Run `terok-sandbox vault passphrase reveal`, copy the value into"
            " an off-host store (password manager / paper safe), and confirm"
            " when prompted; or run `terok-sandbox vault passphrase acknowledge`"
            " after capturing the value via `--echo-passphrase`."
        ),
    )