Runbooks

Automate operational responses with safety guardrails and approval gates for high-impact actions.

Triggers

TriggerFires when
AlertA specific alert fires
AnomalyAIOps detects an anomaly
IncidentAn incident is created
ScheduleA cron expression matches
WebhookAn external POST is received
ManualRun manually from the UI

Actions

ActionWhat it does
ScaleScale deployment replicas
RestartRolling restart of pods
NotifySend a notification to a channel
ExecRun a command in a pod
PatchApply a patch to a K8s resource
DrainDrain a node
CordonMark a node unschedulable
CustomRun a custom script

Safety guardrails & approval gates

  • Max executions per hour (default: 5) and cooldown between runs.
  • Scope limited to specific namespaces or clusters.
  • Dry-run mode to verify without applying changes.
  • High-impact actions (drain, scale >3x, exec in production) pause for admin/owner approval.

Execution history

A timeline of every execution with status (success, failed, pending approval, cancelled), detailed step logs, duration, and affected resources.