Operations

Operations & Access

How to check the system's health, get access, and recover from common failures. Adapted from the operational runbook — verify specifics against current infrastructure.

Quick health checks

The fastest ways to tell if the platform is healthy:

Getting SSH access

Developer access is granted by adding the developer's public key to each server. In brief:

  1. Generate a key pair (ssh-keygen on Linux/macOS; PuTTYgen on Windows).
  2. From the Linode Cloud terminal, sign into each server and append the public key to ~/.ssh/authorized_keys.
  3. Access is needed per-server for SSH, SFTP, and Mongo administration.
Linode dashboard
https://cloud.linode.com/ — the console for all servers (API nodes, Mongo replica set, web). Individual servers expose CPU/network graphs used for the checks below.

Recovery runbook

API instance unresponsive

If performance degrades (often most visible in the admin panel), an API instance may have crashed without recovering. The production API nodes are api-1, api-2, and api-4 (PM2, behind a NodeBalancer). SSH into each and use PM2:

pm2 monit          # inspect per-instance CPU/memory/status
pm2 restart API    # restart the cluster on that node

Space restarts across nodes by a few minutes so the NodeBalancer can keep serving. After a restart you may briefly see Mongo connection errors in the console — these should stop shortly; a steady stream of green lines means recovery. Re-check the API URLs above afterward.

api-3 is different
api-3 is managed by Coolify, not PM2, and runs the API worker — don't expect pm2 to control it. Manage it through Coolify instead. See Environments & Deployment.

Database overload

Check the Mongo servers' CPU graphs in the Linode dashboard (mongo-1 / mongo-2 / mongo-3). Healthy CPU is roughly 15% or below; sustained higher load may warrant a restart. mongo-2 has historically run hot and often acts as primary.

A Mongo node has crashed

If a node's CPU graph flatlines or truncates (a crash; Linode emails on a hard crash, not on a hang):

Full database restoration
The prior runbook's "entire site unresponsive / database restoration" procedure was marked by the author as needing a rewrite ("do not follow these instructions"). Treat full-restore steps as unverified and confirm current backup/restore procedure before acting. Tracked in Gaps & Open Questions.

Deployment notes