VPA-Based Resource Limit Updates
Summary
This document outlines a plan to systematically update resource limits across the cluster based on VPA (Vertical Pod Autoscaler) recommendations from Goldilocks to eliminate CPU throttling alerts.
Changes Already Made
1. Alert Configuration
File: kubernetes/apps/observability/kube-prometheus-stack/app/alertmanagerconfig.yaml
- Changed default receiver from
pushover
to"null"
- Added explicit routes for
severity: warning
andseverity: critical
to pushover - Result: Only critical and warning alerts will trigger pushover notifications (no more info-level spam)
2. Promtail Resources
File: kubernetes/apps/observability/promtail/app/helmrelease.yaml
- CPU Request: 50m → 100m
- CPU Limit: 100m → 250m
- Rationale: VPA recommends 101m upper bound, but we added headroom for log bursts
Priority Workloads for Update
High Priority (Currently Throttling or at Risk)
Observability Namespace
-
Loki - Log aggregation
- Current: cpu: 35m request, 200m limit
- VPA: cpu: 23m request, 140m limit
- Action: Keep current limits (already adequate)
-
Grafana - Visualization
- Current: No CPU limits
- VPA: cpu: 63m request, 213m limit
- Action: Add limits - 100m request, 500m limit for burst capacity
-
Internal Nginx Ingress (network namespace)
- Current: cpu: 500m request, no limit
- VPA: cpu: 63m request, 316m limit
- Action: Add 500m limit (keep generous for traffic spikes)
Medium Priority (Good to standardize)
Observability Namespace
-
kube-state-metrics
- VPA: cpu: 23m request, 77m limit
- Action: Add resources block
-
Goldilocks Controller
- VPA: cpu: 587m request, 2268m limit (!)
- Action: Add generous limits for this workload
-
Blackbox Exporter
- VPA: cpu: 15m request, 37m limit
- Action: Add resources block
Network Namespace
-
External Nginx Ingress
- VPA: cpu: 49m request, 165m limit
- Action: Add resources block
-
Cloudflared
- VPA: cpu: 15m request, 214m limit
- Action: Add resources block (note the high burst)
Low Priority (Already well-configured)
- Node Exporter: Current limits are generous (250m limit vs 22m VPA)
- DCGM Exporter: Has limits, VPA shows adequate
- Media workloads: Most have no CPU limits (intentional for high CPU apps like Plex, Bazarr)
Implementation Strategy
Phase 1: Stop the Alerts (DONE ✅)
- Update alertmanagerconfig to filter by severity
- Update promtail CPU limits
Phase 2: Observability Namespace (Next)
Update these critical monitoring components:
- Grafana - Add CPU limits
- kube-state-metrics - Add resources
- Goldilocks controller - Add resources
- Blackbox exporter - Add resources
Phase 3: Network Infrastructure
- Internal nginx ingress - Add CPU limit
- External nginx ingress - Add resources
- Cloudflared - Add resources
Phase 4: Optional Refinements
- Review VPA recommendations quarterly
- Adjust limits based on actual usage patterns
- Consider enabling VPA auto-mode for non-critical workloads
How to Use VPA Recommendations
1. View All Recommendations
# Run the helper script
./scripts/vpa-resource-recommendations.sh
# Or visit the dashboard
open https://goldilocks.chelonianlabs.com
2. Get Specific Workload Recommendations
kubectl get vpa -n observability goldilocks-grafana -o jsonpath='{.status.recommendation.containerRecommendations[0]}' | jq
3. Update HelmRelease
Add resources block under values:
:
values:
resources:
requests:
cpu: <vpa_target>
memory: <vpa_target_memory>
limits:
cpu: <vpa_upper_or_2x_for_bursts>
memory: <vpa_upper_memory>
4. Apply and Monitor
# Commit changes
git add kubernetes/apps/observability/grafana/app/helmrelease.yaml
git commit -m "feat(grafana): add CPU limits based on VPA recommendations"
git push
# Force reconciliation (optional)
flux reconcile helmrelease -n observability grafana
# Monitor for throttling
kubectl top pods -n observability --containers
VPA Interpretation Guide
VPA Recommendation Fields:
target
: Use as your request valuelowerBound
: Minimum to functionupperBound
: Use as limit (or higher for burst workloads)uncappedTarget
: What VPA thinks is ideal without constraints
When to Deviate:
- Burst workloads (logs, ingress): Use 2-3x upper bound for limits
- Background jobs: Match VPA recommendations closely
- User-facing apps: Add 50-100% headroom for traffic spikes
- Resource-constrained: Start with target, monitor, then adjust
Monitoring for Success
After updates, verify alerts have stopped:
# Check for CPU throttling alerts
kubectl get alerts -A | grep -i throttl
# Check actual CPU usage vs limits
kubectl top pods -A --containers | sort -k4 -h -r | head -20
# Review VPA over time
watch kubectl get vpa -n observability
Tools Created
scripts/vpa-resource-recommendations.sh
- Extract VPA recommendations with HelmRelease file locations- This document - Implementation plan and guidance