Monitoring Stack Gap Analysis - October 17, 2025

Executive Summary

Comprehensive review of Grafana, Prometheus, and Loki monitoring stack revealed the core components are functional with 97.6% operational status. Critical issues identified require both Kubernetes configuration changes and external Ceph infrastructure remediation.

Component Status

✅ Grafana (Healthy)

Status: Running (2/2 containers)
Memory: 441Mi
URL: grafana.chelonianlabs.com
Datasources: Properly configured
- Prometheus: http://prometheus-operated.observability.svc.cluster.local:9090
- Loki: http://loki-headless.observability.svc.cluster.local:3100
- Alertmanager: http://alertmanager-operated.observability.svc.cluster.local:9093
Dashboards: 35+ configured and loading
Issues: None

✅ Prometheus (Healthy with Minor Issues)

Status: Running HA mode (2 replicas)
Memory: 2.1GB per pod
Scrape Success: 161/165 targets healthy (97.6%)
Storage: 5.8GB/100GB used (6%)
Retention: 14 days
Monitoring Coverage:
- 38 ServiceMonitors
- 7 PodMonitors
- 44 PrometheusRules
Issues:
- 4 targets down (2.4% failure rate)
- Duplicate timestamp warnings from kube-state-metrics

⚠️ Loki (Functional but Dropping Logs)

Status: Running (2/2 containers)
Memory: 340Mi
Storage: 1.6GB/30GB used (5%)
Retention: 14 days
Log Collection: Successfully collecting from 17 namespaces
Issues:
- CRITICAL: Max entry size limit (256KB) exceeded
- Plex logs (553KB entries) being rejected
- Error: Max entry size '262144' bytes exceeded

✅ Promtail (Healthy)

Status: DaemonSet running on all 11 nodes
Memory: 70-140Mi per pod
Target: http://loki-headless.observability.svc.cluster.local:3100/loki/api/v1/push
Issues: None (successfully shipping logs despite Loki rejections)

⚠️ Alertmanager (Healthy but Active Alerts)

Status: Running (2/2 containers)
Memory: 37Mi
Active Alerts: 19 alerts firing
Issues: See Active Alerts section below

Critical Issues

1. Loki Log Entry Size Limit

Severity: High Impact: Logs from high-volume applications being dropped

Details:

Default max entry size: 262,144 bytes (256KB)
Plex application producing 553KB log entries
Logs silently dropped without alerting

Fix Applied:

✅ Updated /kubernetes/apps/observability/loki/app/helmrelease.yaml
Added limits_config.max_line_size: 1048576 (1MB)
Action Required: Commit and push to trigger Flux reconciliation

Verification:

# After deployment, verify no more errors:
kubectl logs -n observability -l app.kubernetes.io/name=promtail --tail=100 | grep "exceeded"

2. External Ceph Cluster Health Warnings

Severity: High Impact: PVC provisioning failures, pod scheduling blocked

Details: External Ceph cluster (running on Proxmox hosts) showing HEALTH_WARN:

PG_AVAILABILITY (Critical):
- 128 placement groups inactive
- 128 placement groups incomplete
- This is blocking new PVC creation
MDS_SLOW_METADATA_IO:
- 1 MDS (metadata server) reporting slow I/O
- Impacts CephFS performance
MDS_TRIM:
- 1 MDS behind on trimming
- Can impact metadata operations

Ceph Cluster Info:

FSID: 782dd297-215e-4c35-b7cf-659c20e6909e
Version: 18.2.7-0 (Reef)
Monitors: proxmox-02 (10.150.0.2), proxmox-03 (10.150.0.3), Proxmox-04 (10.150.0.4)
Capacity: 195TB available / 244TB total (80% used)

Action Required: These are infrastructure-level issues that must be resolved on the Proxmox/Ceph cluster directly:

# SSH to Proxmox host and run:
ceph health detail
ceph pg dump | grep -E "inactive|incomplete"
ceph osd tree
ceph fs status cephfs

# Likely fixes (depending on root cause):
# - Check OSD status and bring up any down OSDs
# - Verify network connectivity between OSDs
# - Check disk space on OSD nodes
# - Review Ceph logs for specific PG issues

Kubernetes Impact:

❌ Gatus pod stuck in Pending (PVC provisioning failure)
❌ VolSync destination pods failing
❌ Any new workloads requiring CephFS storage blocked

Prometheus Scrape Target Failures

Down Targets (4 total):

athena.manor:9221 - Unnamed exporter (likely SNMP)
circe.manor:9221 - Unnamed exporter (likely SNMP)
nut-upsd.kube-system.svc.cluster.local:3493 - NUT UPS exporter
zigbee-controller-garage.manor - Zigbee controller

Analysis: All down targets are edge devices or external services. Core Kubernetes monitoring intact.

Recommended Actions:

Verify network connectivity to .manor hostnames
Check if SNMP exporters are running
Investigate NUT UPS service in kube-system namespace
Verify zigbee-controller service status

Active Alerts (19 Total)

High Priority:

TargetDown - Related to 4 targets listed above
KubePodNotReady - Related to Ceph PVC provisioning issues (gatus, volsync)
KubeDeploymentRolloutStuck - Likely gatus deployment
KubePersistentVolumeFillingUp - Check which PVs

Medium Priority:

CPUThrottlingHigh - Investigate which pods/namespaces
KubeJobFailed - 2 failed jobs identified:
- kometa-29344680 (media namespace)
- plex-off-deck-29344620 (media namespace)
VolSyncVolumeOutOfSync - Expected with current Ceph issues

Informational:

Watchdog - Always firing (heartbeat)
PrometheusDuplicateTimestamps - kube-state-metrics timing issue (low impact)

Recommendations

Immediate Actions (Required before further work):

✅ Loki configuration updated - Ready for commit
⚠️ Fix Ceph PG issues - Must be done on Proxmox hosts
⚠️ Verify Ceph health - Run ceph health detail on Proxmox

Post-Ceph Fix:

Delete stuck pods to retry provisioning:

kubectl delete pod -n observability gatus-6fcfb64bc8-zz996
kubectl delete pod -n observability volsync-dst-gatus-dst-8wvtx

Investigate and fix down Prometheus targets:
- Check SNMP exporter configurations
- Verify NUT UPS service
- Test network connectivity to .manor devices

Review CPU throttling alerts:

kubectl top pods -A --sort-by=cpu
# Adjust resource limits as needed

Clean up failed CronJobs in media namespace

Long-term Improvements:

Add Loki ingestion metrics dashboard
Configure log sampling/filtering for high-volume apps
Set up PVC capacity monitoring alerts
Review and tune Prometheus scrape intervals
Consider adding CephFS-specific dashboards

Verification Checklist

After applying fixes:

Loki accepting large log entries (check Promtail logs)
No "exceeded" errors in Promtail logs
Ceph cluster shows HEALTH_OK
Gatus pod Running (2/2)
All PVCs Bound
Prometheus targets down count <= 2 (excluding optional edge devices)
Active alerts reduced to baseline (~5-10 expected)
All core namespace pods Running

Infrastructure Context

Deployment Method:

GitOps: FluxCD
Workflow: Edit repo → User commits → User pushes → Flux reconciles

Storage:

Provider: External Ceph cluster (Proxmox)
Storage Classes: cephfs-shared (default), cephfs-static
Provisioner: rook-ceph.cephfs.csi.ceph.com

Monitoring Namespace:

Namespace: observability
Components: Grafana, Prometheus (HA), Loki, Promtail, Alertmanager
Additional: VPA, Goldilocks, Gatus, Kromgo, various exporters

Next Steps

User Action: Review and commit Loki configuration changes
User Action: Fix Ceph PG availability issues on Proxmox
After Ceph Fix: Proceed with pod cleanup and target investigations
Monitor: Watch for new alerts or recurring issues

Generated: 2025-10-17 Analysis Duration: ~30 minutes Status: Awaiting user commit and Ceph infrastructure remediation