Monitoring¶

This document describes what to watch in production, alert thresholds, and how to investigate common issues.

Key Metrics¶

Application¶

Metric	Normal Range	Alert Threshold	Action
HTTP 5xx error rate	< 0.1%	> 1% over 5 min	Check pod logs, recent deployments
P95 response latency	< 500 ms	> 2 s	Check DB query times, cache hit rate
Active Hangfire jobs	< 50 queued	> 200 queued	Check worker health, inspect stuck jobs
Failed Hangfire jobs	0	> 5 in 1 hour	Investigate via dashboard
JWT auth failures	< 5/min	> 50/min	Possible brute-force — check rate limiting

Database¶

Metric	Normal Range	Alert Threshold	Action
Active connections	< 30	> 50	Check for connection leaks
Query P95 latency	< 100 ms	> 500 ms	Run `EXPLAIN ANALYZE` on slow queries
Replication lag	N/A (single)	> 60 s (if replica)	Check replica health
Dead tuples	Routine autovacuum	`pg_stat_user_tables.n_dead_tup` > 1M	Run `VACUUM ANALYZE`

Redis¶

Metric	Normal Range	Alert Threshold	Action
Cache hit rate	> 85%	< 60%	Investigate cache key expiry, Redis health
Used memory	< 80% of `maxmemory`	> 90%	Increase memory or reduce TTLs
Connected clients	< 30	> 80	Check for connection leaks

Infrastructure¶

Metric	Normal Range	Alert Threshold	Action
Pod CPU	< 60%	> 80% for 10 min	HPA should scale up; investigate if not
Pod memory	< 75%	> 90%	Possible memory leak; rolling restart
Disk (media PVC)	< 70%	> 85%	Archive old media or expand PVC
Node disk	< 70%	> 85%	Clean old images, expand node

Health Checks¶

The health endpoint is available at:

GET /health

Checks: PostgreSQL (primary DB) and Redis. Returns 200 Healthy if all checks pass, 503 Unhealthy if any check fails.

Used by Kubernetes liveness and readiness probes.

Investigating Issues¶

High Error Rate¶

# View recent errors
kubectl logs -l app=truload-backend -n truload --tail=200 | grep '"level":"Error"'

# Check if a bad deployment triggered it
kubectl rollout history deployment/truload-backend -n truload

Slow Queries¶

-- Find queries running > 5 seconds
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
  AND state = 'active';

Redis Cache Miss Spike¶

kubectl exec -it redis-<pod-id> -n truload -- redis-cli INFO stats \
  | grep -E "keyspace_hits|keyspace_misses"

Hangfire Queue Depth¶

Open /hangfire in the browser. If the default queue is backed up:

Check if all workers are alive under Servers
Look for recurring jobs taking longer than expected under Processing
Check for any failed job that is blocking enqueued work

Alerts (Recommended)¶

Configure these alerts in your monitoring system (Prometheus/Grafana or similar):

Alert Name	Condition	Severity
`HighErrorRate`	HTTP 5xx > 1% for 5 min	Critical
`SlowResponses`	P95 latency > 2 s for 5 min	Warning
`HangfireQueueDepth`	Queued jobs > 200	Warning
`HangfireFailedJobs`	Failed jobs > 5 in 1 h	Error
`DatabaseConnectionHigh`	Active connections > 50	Warning
`PodCpuHigh`	CPU > 80% for 10 min	Warning
`PodMemoryHigh`	Memory > 90%	Critical
`RedisMemoryHigh`	Redis memory > 90%	Warning
`DiskSpaceHigh`	Disk > 85%	Warning