Operations

Operations & Access

How to check the system's health, get access, and recover from common failures. Adapted from the operational runbook — verify specifics against current infrastructure.

Quick health checks

The fastest ways to tell if the platform is healthy:

Status dashboard — https://status.macaronikid.com/ reports the three layers (front end, API, admin) at a glance.
Public pages — load national.macaronikid.com (home) and national.macaronikid.com/articles (slightly more DB-dependent); slow loads hint at database strain.
Direct API calls — these should return structured data instantly (reload a few times since the API is load-balanced):
- api.macaronikid.com/api/v1/town/data/national
- api.macaronikid.com/api/v1/towns/locations
Sentry — application errors and performance regressions surface here.
Linode & CloudFlare dashboards — per-server CPU/network graphs; CloudFlare for traffic/DDoS signals.

Monitoring & uptime

Layered, from quickest glance to deepest detail:

Tool	What it tells you
`status.macaronikid.com`	Overall ecosystem health, broken into three parts — front end, API, admin. Quickest at-a-glance check; no deep detail.
Health URLs (above)	Real user-facing response: home page speed, the articles page (more DB-bound), and direct API endpoints (reload to hit different load-balanced nodes).
Sentry	Live application error stream and performance — the first place a code-level regression shows up.
Linode dashboard	Per-server CPU/network graphs for every node. Flatlined or truncated CPU = a crashed/stopped instance (Linode emails on a hard crash, not on a hang).
CloudFlare dashboard	Edge traffic and DDoS signals for the proxied production town domains.

Healthy baselines

On the database nodes, CPU around 15% or lower is healthy; sustained higher load (historically mongo-2, which often runs primary) may warrant a restart. On the API nodes, use pm2 monit to watch per-instance CPU/memory — no CPU or runaway memory on an instance is a sign to pm2 restart API.

Getting SSH access

Developer access is granted by adding the developer's public key to each server. In brief:

Generate a key pair (ssh-keygen on Linux/macOS; PuTTYgen on Windows).
From the Linode Cloud terminal, sign into each server and append the public key to ~/.ssh/authorized_keys.
Access is needed per-server for SSH, SFTP, and Mongo administration.

Linode dashboard

https://cloud.linode.com/ — the console for all servers (API nodes, Mongo replica set, web). Individual servers expose CPU/network graphs used for the checks below.

Recovery runbook

API instance unresponsive

If performance degrades (often most visible in the admin panel), an API instance may have crashed without recovering. The production API nodes are api-1, api-2, and api-4 (PM2, behind a NodeBalancer). SSH into each and use PM2:

pm2 monit          # inspect per-instance CPU/memory/status
pm2 restart API    # restart the cluster on that node

Space restarts across nodes by a few minutes so the NodeBalancer can keep serving. After a restart you may briefly see Mongo connection errors in the console — these should stop shortly; a steady stream of green lines means recovery. Re-check the API URLs above afterward.

api-3 is different

api-3 is managed by Coolify, not PM2, and runs the API worker — don't expect pm2 to control it. Manage it through Coolify instead. See Environments & Deployment.

Database overload

Check the Mongo servers' CPU graphs in the Linode dashboard (mongo-1 / mongo-2 / mongo-3). Healthy CPU is roughly 15% or below; sustained higher load may warrant a restart. mongo-2 has historically run hot and often acts as primary.

A Mongo node has crashed

If a node's CPU graph flatlines or truncates (a crash; Linode emails on a hard crash, not on a hang):

Do not restart Mongo directly — reboot the server (shutdown -r now). The node should rejoin as SECONDARY.
At a stable moment, force the PRIMARY back to mongo-1 (the backed-up node).
API instances obtain their DB connection at startup and won't reconnect automatically if a Mongo node drops — so after restoring Mongo, pm2 restart API on each API node.

Full database restoration

The prior runbook's "entire site unresponsive / database restoration" procedure was marked by the author as needing a rewrite ("do not follow these instructions"). Treat full-restore steps as unverified and confirm current backup/restore procedure before acting. Tracked in Gaps & Open Questions.

Deployment notes

API — PM2 deploy pulls origin/master and restarts the cluster. SSH into each API node one at a time when updating manually.
web2 — image build via GitHub Actions; the legacy git-deploy.php webhook ("Git Deployment Hamster") pulls code on push and has a deployment-status URL to confirm success.
admin panel — image build via GitHub Actions.