Dapper Cluster Documentation
This documentation covers the architecture, configuration, and operations of the Dapper Kubernetes cluster, a high-performance home lab infrastructure with GPU capabilities.
Cluster Overview
graph TD subgraph Control Plane CP1[Control Plane 1<br>4 CPU, 16GB] CP2[Control Plane 2<br>4 CPU, 16GB] CP3[Control Plane 3<br>4 CPU, 16GB] end subgraph Worker Nodes W1[Worker 1<br>16 CPU, 128GB] W2[Worker 2<br>16 CPU, 128GB] GPU[GPU Node<br>16 CPU, 128GB<br>4x Tesla P100] end CP1 --- CP2 CP2 --- CP3 CP3 --- CP1 Control Plane --> Worker Nodes
Hardware Specifications
Control Plane
- 3 nodes for high availability
- 4 CPU cores per node
- 16GB RAM per node
- Dedicated to cluster control plane operations
Worker Nodes
- 2 general-purpose worker nodes
- 16 CPU cores per node
- 128GB RAM per node
- Handles general workloads and applications
GPU Node
- Specialized GPU worker node
- 16 CPU cores
- 128GB RAM
- 4x NVIDIA Tesla P100 GPUs
- Handles ML/AI and GPU-accelerated workloads
Key Features
- High-availability Kubernetes cluster
- GPU acceleration support
- Automated deployment using Flux CD
- Secure secrets management with SOPS
- NFS and OpenEBS storage integration
- Comprehensive monitoring and observability
- Media services automation
Infrastructure Components
graph TD subgraph Core Services Flux[Flux CD] Storage[Storage Layer] Network[Network Layer] end subgraph Applications Media[Media Stack] Monitor[Monitoring] GPU[GPU Workloads] end Core Services --> Applications Storage --> |NFS/OpenEBS| Applications Network --> |Ingress/DNS| Applications
Documentation Structure
-
Architecture: Detailed technical documentation about cluster design and components
- High-availability control plane design
- Storage architecture and configuration
- Network topology and policies
- GPU integration and management
-
Applications: Information about deployed applications and their configurations
- Media services stack
- Monitoring and observability
- GPU-accelerated applications
-
Operations: Guides for installation, maintenance, and troubleshooting
- Cluster setup procedures
- Node management
- GPU configuration
- Maintenance tasks
Getting Started
For new users, we recommend starting with:
- Architecture Overview - Understanding the cluster design
- Installation Guide - Setting up the cluster
- Application Stack - Deploying applications
Quick Links
Architecture Overview
Cluster Architecture
graph TD subgraph Control Plane CP1[Control Plane 1<br>4 CPU, 16GB] CP2[Control Plane 2<br>4 CPU, 16GB] CP3[Control Plane 3<br>4 CPU, 16GB] CP1 --- CP2 CP2 --- CP3 CP3 --- CP1 end subgraph Worker Nodes W1[Worker 1<br>16 CPU, 128GB] W2[Worker 2<br>16 CPU, 128GB] end subgraph GPU Node GPU[GPU Worker<br>16 CPU, 128GB<br>4x Tesla P100] end Control Plane --> Worker Nodes Control Plane --> GPU
Core Components
Control Plane
- High Availability: 3-node control plane configuration
- Resource Allocation: 4 CPU, 16GB RAM per node
- Components:
- etcd cluster
- API Server
- Controller Manager
- Scheduler
Worker Nodes
- General Purpose Workers: 2 nodes
- Resources per Node:
- 16 CPU cores
- 128GB RAM
- Workload Types:
- Application deployments
- Database workloads
- Media services
- Monitoring systems
GPU Node
- Specialized Worker: 1 node
- Hardware:
- 16 CPU cores
- 128GB RAM
- 4x NVIDIA Tesla P100 GPUs
- Workload Types:
- ML/AI workloads
- Video transcoding
- GPU-accelerated applications
Network Architecture
graph TD subgraph External Internet((Internet)) DNS((DNS)) end subgraph Network Edge FW[Firewall] LB[Load Balancer] end subgraph Kubernetes Network CP[Control Plane] Workers[Worker Nodes] GPUNode[GPU Node] subgraph Services Ingress[Ingress Controller] CoreDNS[CoreDNS] CNI[Network Plugin] end end Internet --> FW DNS --> FW FW --> LB LB --> CP CP --> Workers CP --> GPUNode Services --> Workers Services --> GPUNode
Storage Architecture
graph TD subgraph Storage Classes NFS[NFS Storage Class] OpenEBS[OpenEBS Storage Class] end subgraph Persistent Volumes NFS --> NFS_PV[NFS PVs] OpenEBS --> Local_PV[Local PVs] end subgraph Workload Types NFS_PV --> Media[Media Storage] NFS_PV --> Shared[Shared Config] Local_PV --> DB[Databases] Local_PV --> Cache[Cache Storage] end
Security Considerations
- Network segmentation using Kubernetes network policies
- Encrypted secrets management with SOPS
- TLS encryption for all external services
- Regular security updates via automated pipelines
- GPU access controls and resource quotas
Scalability
The cluster architecture is designed to be scalable:
- High-availability control plane (3 nodes)
- Expandable worker node pool
- Specialized GPU node for compute-intensive tasks
- Dynamic storage provisioning
- Load balancing for external services
- Resource quotas and limits management
Monitoring and Observability
graph LR subgraph Monitoring Stack Prom[Prometheus] Graf[Grafana] Alert[Alertmanager] end subgraph Node Types CP[Control Plane Metrics] Work[Worker Metrics] GPU[GPU Metrics] end CP --> Prom Work --> Prom GPU --> Prom Prom --> Graf Prom --> Alert
Resource Management
Control Plane
- Reserved for kubernetes control plane components
- Optimized for control plane operations
- High availability configuration
Worker Nodes
- General purpose workloads
- Balanced resource allocation
- Flexible scheduling options
GPU Node
- Dedicated for GPU workloads
- NVIDIA GPU operator integration
- Specialized resource scheduling
Network Architecture
This document covers the Kubernetes application-level networking. For physical network topology and VLAN configuration, see Network Topology.
Container Networking (CNI)
Cilium CNI
The cluster uses Cilium as the primary Container Network Interface (CNI):
- Pod CIDR: 10.69.0.0/16 (native routing mode)
- Service CIDR: 10.96.0.0/16
- Mode: Non-exclusive (paired with Multus for multi-network support)
- Kube-Proxy Replacement: Enabled (eBPF-based service load balancing)
- Load Balancing Algorithm: Maglev with DSR (Direct Server Return)
- Network Policy: Endpoint routes enabled
- BPF Masquerading: Enabled for outbound traffic
Key Features:
- High-performance eBPF data plane
- Native Kubernetes network policy support
- L2 announcements for external load balancer IPs
- Advanced observability and monitoring
Multus CNI (Multiple Networks)
Multus provides additional network interfaces to pods beyond the primary Cilium network:
- Primary Use: IoT network attachment (VLAN-based isolation)
- Network Attachment: macvlan on ens19 interface
- Mode: Bridge mode with DHCP IPAM
- Purpose: Enable pods to connect to additional networks (e.g., IoT devices, legacy systems)
Pods can request additional networks via annotations:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: macvlan-conf
Ingress Controllers
The cluster uses dual ingress-nginx controllers for traffic routing:
Internal Ingress
- Class:
internal
(default) - Purpose: Internal services, private DNS
- Version: v4.13.3
- Load Balancer: Cilium L2 announcement
- DNS: Synced to internal DNS via k8s-gateway and External-DNS (UniFi webhook)
External Ingress
- Class:
external
- Purpose: Public-facing services
- Version: v4.13.3
- Load Balancer: Cilium L2 announcement
- DNS: Synced to Cloudflare via External-DNS
- Tunnel: Cloudflared for secure access
Load Balancer IP Management
Cilium L2 Announcements
Cilium's L2 announcement feature provides load balancer IPs for services:
- How it works: Cilium announces load balancer IPs via L2 (ARP/NDP)
- Policy-based: L2AnnouncementPolicy defines which services get announced
- Benefits:
- No external load balancer required
- Native Kubernetes LoadBalancer service type support
- High availability through leader election
- Automatic failover
Configuration: See kubernetes/apps/kube-system/cilium/config/l2.yaml
This enables both ingress controllers to receive external IPs that are accessible from the broader network.
Network Policies
graph LR subgraph Policies Default[Default Deny] Allow[Allowed Routes] end subgraph Apps Media[Media Stack] Monitor[Monitoring] DB[Databases] end Allow --> Media Allow --> Monitor Default --> DB
DNS Configuration
Internal DNS (k8s-gateway)
- Purpose: DNS server for internal ingresses
- Domain: Internal cluster services
- Integration: Works with External-DNS for automatic record creation
External-DNS (Dual Instances)
Instance 1: Internal DNS
- Provider: UniFi (via webhook provider)
- Target: UDM Pro Max
- Ingress Class:
internal
- Purpose: Sync private DNS records for internal services
Instance 2: External DNS
- Provider: Cloudflare
- Ingress Class:
external
- Purpose: Sync public DNS records for externally accessible services
How DNS Works
- Create an Ingress with class
internal
orexternal
- External-DNS watches for new/updated ingresses
- Appropriate External-DNS instance syncs DNS records to target provider
- Services become accessible via their configured hostnames
Security
Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
TLS Configuration
- Automatic certificate management via cert-manager
- Let's Encrypt integration
- Internal PKI for service mesh
Service Mesh
Traffic Flow
graph LR subgraph Ingress External[External Traffic] Traefik[Traefik] end subgraph Services App1[Service 1] App2[Service 2] DB[Database] end External --> Traefik Traefik --> App1 Traefik --> App2 App1 --> DB App2 --> DB
Best Practices
-
Security
- Implement default deny policies
- Use TLS everywhere
- Regular security audits
- Network segmentation
-
Performance
- Load balancer optimization
- Connection pooling
- Proper resource allocation
- Traffic monitoring
-
Reliability
- High availability configuration
- Failover planning
- Backup routes
- Health checks
-
Monitoring
- Network metrics collection
- Traffic analysis
- Latency monitoring
- Bandwidth usage tracking
Troubleshooting
Common network issues and resolution steps:
-
Connectivity Issues
- Check network policies
- Verify DNS resolution
- Inspect service endpoints
- Review ingress configuration
-
Performance Problems
- Monitor network metrics
- Check for bottlenecks
- Analyze traffic patterns
- Review resource allocation
Network Topology
Overview
The Dapper Cluster network spans two physical locations (garage/shop and house) connected via a 1Gbps wireless bridge. The network uses a dual-switch design in the garage with high-speed interconnects for server and storage traffic.
Network Locations
Garage/Shop (Server Room)
- Primary compute infrastructure
- Core and distribution switches
- 4x Proxmox hosts running Talos/Kubernetes
- High-speed storage network
House
- Access layer switch
- OPNsense router/firewall
- Client devices
- Connected to garage via 60GHz wireless bridge (1Gbps)
Device Inventory
Core Network Equipment
Device | Model | Management IP | Location | Role | Notes |
---|---|---|---|---|---|
Brocade Core | ICX6610 | 192.168.1.20 | Garage | Core/L3 Switch | Manages VLAN 150, 200 routing |
Arista Distribution | 7050 | 192.168.1.21 | Garage | Distribution Switch | High-speed 40Gb interconnects |
Aruba Access | S2500-48p | 192.168.1.26 | House | Access Switch | PoE, client devices |
OPNsense Router | i3-4130T - 16GB | 192.168.1.1 | House | Router/Firewall | Manages VLAN 1, 100 routing |
Mikrotik Radio (House) | NRay60 | 192.168.1.7 | House | Wireless Bridge | 1Gbps to garage |
Mikrotik Radio (Shop) | NRay60 | 192.168.1.8 | Garage | Wireless Bridge | 1Gbps to house |
Mikrotik Switch | CSS326-24G-2S | 192.168.1.27 | Garage | Wireless Bridge - Brocade Core | Always up interconnect |
Compute Infrastructure
Device | Management IP | IPMI IP | Location | Links | Notes |
---|---|---|---|---|---|
Proxmox Host 1 | 192.168.1.62 | 192.168.1.162 | Garage | 6 total | 3x 1Gb, 2x 10Gb, 1x 40Gb |
Proxmox Host 2 | 192.168.1.63 | 192.168.1.165 | Garage | 6 total | 3x 1Gb, 2x 10Gb, 1x 40Gb |
Proxmox Host 3 | 192.168.1.64 | 192.168.1.163 | Garage | 6 total | 3x 1Gb, 2x 10Gb, 1x 40Gb |
Proxmox Host 4 | 192.168.1.66 | 192.168.1.164 | Garage | 6 total | 3x 1Gb, 2x 10Gb, 1x 40Gb |
Kubernetes Nodes (VMs on Proxmox)
Hostname | Primary IP (VLAN 100) | Storage IP (VLAN 150) | Role | Host |
---|---|---|---|---|
talos-control-1 | 10.100.0.50 | 10.150.0.10 | Control Plane | Proxmox-03 |
talos-control-2 | 10.100.0.51 | 10.150.0.11 | Control Plane | Proxmox-04 |
talos-control-3 | 10.100.0.52 | 10.150.0.12 | Control Plane | Proxmox-02 |
talos-node-gpu-1 | 10.100.0.53 | 10.150.0.13 | Worker (GPU) | Proxmox-03 |
talos-node-large-1 | 10.100.0.54 | 10.150.0.14 | Worker | Proxmox-03 |
talos-node-large-2 | 10.100.0.55 | 10.150.0.15 | Worker | Proxmox-03 |
talos-node-large-3 | 10.100.0.56 | 10.150.0.16 | Worker | Proxmox-03 |
Kubernetes Cluster VIP: 10.100.0.40 (shared across control plane nodes)
VLAN Configuration
VLAN ID | Network | Subnet | Gateway | MTU | Purpose | Gateway Device | Notes |
---|---|---|---|---|---|---|---|
1 | LAN | 192.168.1.0/24 | 192.168.1.1 | 1500 | Management, clients | OPNsense | Default VLAN |
100 | SERVERS | 10.100.0.0/24 | 10.100.0.1 | 1500 | Kubernetes nodes, VMs | OPNsense | Primary server network |
150 | CEPH-PUBLIC | 10.150.0.0/24 | None (internal) | 9000 | Ceph client/monitor | Brocade | Jumbo frames enabled, no gateway needed |
200 | CEPH-CLUSTER | 10.200.0.0/24 | None (internal) | 9000 | Ceph OSD replication | Arista | Jumbo frames enabled, no gateway needed |
Kubernetes Internal Networks
Network | CIDR | Purpose | MTU |
---|---|---|---|
Pod Network | 10.69.0.0/16 | Cilium pod CIDR | 1500 |
Service Network | 10.96.0.0/16 | Kubernetes services | 1500 |
VLAN Tagging Summary
- Tagged (Trunked): All inter-switch links, Proxmox host uplinks (for VM traffic)
- Untagged Access Ports: Client devices on appropriate VLANs
- [TODO: Document which VLANs are allowed on which trunk ports]
Physical Topology
High-Level Site Connectivity
graph TB subgraph House ARUBA[Aruba S2500-48p<br/>192.168.1.26] OPNS[OPNsense Router<br/>Gateway for VLAN 1, 100] MTIKHOUSE[Mikrotik NRay60<br/>192.168.1.7] CLIENTS[Client Devices] OPNS --- ARUBA ARUBA --- CLIENTS ARUBA --- MTIKHOUSE end subgraph Wireless Bridge [60GHz Wireless Bridge - 1Gbps] MTIKHOUSE <-.1Gbps Wireless.-> MTIKSHOP end subgraph Garage/Shop MTIKSHOP[Mikrotik NRay60<br/>192.168.1.8] BROCADE[Brocade ICX6610<br/>192.168.1.20<br/>Core/L3 Switch] ARISTA[Arista 7050<br/>192.168.1.21<br/>Distribution Switch] MTIKSHOP --- BROCADE PX1[Proxmox Host 1<br/>6 links] PX2[Proxmox Host 2<br/>6 links] PX3[Proxmox Host 3<br/>6 links] PX4[Proxmox Host 4<br/>6 links] BROCADE <-->|2x 40Gb QSFP+<br/>ONE DISABLED| ARISTA PX1 --> BROCADE PX2 --> BROCADE PX3 --> BROCADE PX4 --> BROCADE PX1 -.40Gb.-> ARISTA PX2 -.40Gb.-> ARISTA PX3 -.40Gb.-> ARISTA PX4 -.40Gb.-> ARISTA end style BROCADE fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff style ARISTA fill:#389826,stroke:#fff,stroke-width:2px,color:#fff style ARUBA fill:#d83933,stroke:#fff,stroke-width:2px,color:#fff
Proxmox Host Connectivity Detail
Each Proxmox host has 6 network connections:
graph LR subgraph Proxmox Host [Single Proxmox Host - 6 Links] IPMI[IPMI/BMC<br/>1Gb NIC] MGMT[Management Bond<br/>2x 1Gb] VM[VM Bridge Bond<br/>2x 10Gb] CEPH[Ceph Storage<br/>1x 40Gb] end subgraph Brocade ICX6610 B1[Port: 1Gb] B2[LAG: 2x 1Gb] B3[LAG: 2x 10Gb] end subgraph Arista 7050 A1[Port: 40Gb] end IPMI -->|Standalone| B1 MGMT -->|LACP Bond| B2 VM -->|LACP Bond<br/>VLAN 100, 150| B3 CEPH -->|Standalone<br/>VLAN 200| A1 style Brocade fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff style Arista fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
Per-Host Link Summary:
- IPMI: 1x 1Gb to Brocade (dedicated management)
- Proxmox Management: 2x 1Gb LACP bond to Brocade (Proxmox host IP)
- VM Traffic: 2x 10Gb LACP bond to Brocade (bridges for VMs, VLAN 100, 150)
- Ceph Cluster: 1x 40Gb to Arista (VLAN 200 only)
Total Bandwidth per Host:
- To Brocade: 23 Gbps (3 + 20 Gbps)
- To Arista: 40 Gbps
Brocade-Arista Interconnect (ISSUE)
graph LR subgraph Brocade ICX6610 BP1[QSFP+ Port 1<br/>40Gb] BP2[QSFP+ Port 2<br/>40Gb] end subgraph Arista 7050 AP1[QSFP+ Port 1<br/>40Gb<br/>ACTIVE] AP2[QSFP+ Port 2<br/>40Gb<br/>DISABLED] end BP1 ---|Currently: Simple Trunk| AP1 BP2 -.-|DISABLED to prevent loop| AP2 style AP2 fill:#ff0000,stroke:#fff,stroke-width:2px,color:#fff
CURRENT ISSUE:
- 2x 40Gb links are configured as separate trunk ports (default VLAN 1, passing all VLANs)
- This creates a layer 2 loop
- ONE port disabled on Arista side as workaround
- SOLUTION NEEDED: Configure proper LACP/port-channel on both switches
[TODO: Document target LAG configuration]
Logical Topology
Layer 2 VLAN Distribution
graph TB subgraph Layer 2 VLANs V1[VLAN 1: Management<br/>192.168.1.0/24] V100[VLAN 100: Servers<br/>10.100.0.0/24] V150[VLAN 150: Ceph Public<br/>10.150.0.0/24] V200[VLAN 200: Ceph Cluster<br/>10.200.0.0/24] end subgraph Brocade Core B[Brocade ICX6610<br/>L3 Gateway<br/>VLAN 150, 200] end subgraph Arista Distribution A[Arista 7050<br/>L2 Only] end subgraph OPNsense Router O[OPNsense<br/>L3 Gateway<br/>VLAN 1, 100] end V1 --> B V100 --> B V150 --> B V200 --> A B -->|Routes to| O A --> B style B fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff style O fill:#d83933,stroke:#fff,stroke-width:2px,color:#fff
Layer 3 Routing
Primary Gateways:
-
OPNsense (at house):
- VLAN 1: 192.168.1.1
- VLAN 100: 10.100.0.1
- Default gateway for internet access
- 2.5GB ATT Fiber
-
Brocade ICX6610 (at garage):
- VLAN 1: 192.168.1.20
- VLAN 100: 10.100.0.10
- VLAN 150: None
- VLAN 200: None
- VIP's that route back to the gateway at 192.168.1.1 or 10.100.0.1
[TODO: Document inter-VLAN routing rules]
- Can VLAN 150/200 reach the internet? USER-TODO: (Need to check not sure)
- Are there firewall rules blocking inter-VLAN traffic?
- How does Ceph traffic route if needed?
Traffic Flows
VLAN 1 (Management) - 192.168.1.0/24
Purpose: Switch management, IPMI, admin access
Flow:
Client (House)
→ Aruba Switch
→ Wireless Bridge (1Gbps)
→ Brocade
→ Switch/IPMI management interface
Devices:
- All switch management IPs
- Proxmox IPMI interfaces
- Admin workstations
- [TODO: Complete device list]
VLAN 100 (Servers) - 10.100.0.0/24
Purpose: Kubernetes nodes, VM primary network
Flow:
Talos Node (10.100.0.50-56)
→ Proxmox VM Bridge (10Gb bond)
→ Brocade (2x 10Gb bond)
→ Routes through OPNsense for internet
Key Services:
- Kubernetes API: 10.100.0.40:6443 (VIP)
- Talos nodes: 10.100.0.50-56
Internet Access: Yes (via OPNsense gateway)
VLAN 150 (Ceph Public) - 10.150.0.0/24
Purpose: Ceph client connections, monitor communication, CSI drivers
MTU: 9000 (Jumbo frames)
Flow:
Kubernetes Pod (needs storage)
→ Rook CSI Driver
→ Talos Node (10.150.0.10-16)
→ Proxmox VM Bridge (10Gb bond)
→ Brocade (2x 10Gb bond)
→ Ceph Monitors on Proxmox hosts
Key Services:
- Ceph Monitors: 10.150.0.4, 10.150.0.2
- Kubernetes nodes: 10.150.0.10-16 (secondary IPs)
- Rook CSI drivers connect via this network
Gateway: None required (internal-only network) Internet Access: Not needed (Ceph storage network)
Performance:
- 2x 10Gb bonded links per host
- Jumbo frames (MTU 9000)
- Shared with VLAN 100 on same physical bond
VLAN 200 (Ceph Cluster) - 10.200.0.0/24
Purpose: Ceph OSD replication, cluster heartbeat (backend traffic)
MTU: 9000 (Jumbo frames)
Flow:
Ceph OSD on Proxmox Host 1
→ 40Gb link to Arista
→ Arista 7050 (switch fabric)
→ 40Gb link to Proxmox Host 2-4
→ Ceph OSD on other hosts
Key Characteristics:
- Dedicated high-speed path: Uses 40Gb links exclusively
- East-west traffic only: OSD-to-OSD replication
- Does NOT traverse Brocade for data path
- Arista provides switching for this VLAN
Gateway: None required (internal-only network) Internet Access: Not needed (Ceph backend replication only)
Performance:
- 40Gbps per host to Arista
- Dedicated bandwidth (not shared with other traffic)
- Jumbo frames critical for large object transfers
[TODO: Document Proxmox host IPs on this VLAN] USER-TODO: Need to choose/ configure
Traffic Segregation Summary
VLAN | Physical Path | Bandwidth | MTU | Shared? |
---|---|---|---|---|
1 (Management) | 1Gb/10Gb to Brocade | Shared | 1500 | Yes |
100 (Servers) | 2x 10Gb bond to Brocade | 20 Gbps | 1500 | Yes (with VLAN 150) |
150 (Ceph Public) | 2x 10Gb bond to Brocade | 20 Gbps | 9000 | Yes (with VLAN 100) |
200 (Ceph Cluster) | 1x 40Gb to Arista | 40 Gbps | 9000 | No (dedicated) |
Switch Configuration
Brocade ICX6610 Configuration
Role: Core L3 switch, VLAN routing for 150/200
Port Assignments:
[TODO: Document port assignments]
Example:
- Ports 1/1/1-4: IPMI connections (VLAN 1 untagged)
- Ports 1/1/5-12: Proxmox management bonds (LAG groups)
- Ports 1/1/13-20: Proxmox 10Gb bonds (LAG groups, trunk VLAN 100, 150)
- Ports 1/1/41-42: 40Gb to Arista (LAG group, trunk all VLANs)
- Port 1/1/48: Uplink to Mikrotik (trunk all VLANs)
VLAN Interfaces (SVI):
[TODO: Brocade config snippet]
interface ve 1
ip address 192.168.1.20/24
interface ve 150
ip address [TODO]/24
mtu 9000
interface ve 200
ip address [TODO]/24
mtu 9000
Static Routes:
[TODO: Document static routes to OPNsense]
Arista 7050 Configuration
Role: High-speed distribution for VLAN 200 (Ceph cluster)
Port Assignments:
[TODO: Document port assignments]
Example:
- Ports Et1-4: Proxmox 40Gb links (VLAN 200 tagged)
- Ports Et49-50: 40Gb to Brocade (port-channel, trunk all VLANs)
Configuration:
[TODO: Arista config snippet for port-channel]
Aruba S2500-48p Configuration
Role: Access switch at house
Uplink: Via Mikrotik wireless bridge to garage
[TODO: Document VLAN configuration and port assignments]
Common Configuration Tasks
Fix Brocade-Arista LAG Issue
Current State: One 40Gb link disabled to prevent loop
Target State: Both 40Gb links in LACP port-channel
Brocade Configuration:
[TODO: Brocade LACP config]
lag "brocade-to-arista" dynamic id [lag-id]
ports ethernet 1/1/41 to 1/1/42
primary-port 1/1/41
deploy
interface ethernet 1/1/41
link-aggregate active
interface ethernet 1/1/42
link-aggregate active
Arista Configuration:
[TODO: Arista LACP config]
interface Port-Channel1
description Link to Brocade ICX6610
switchport mode trunk
switchport trunk allowed vlan 1,100,150,200
interface Ethernet49
channel-group 1 mode active
description Link to Brocade 40G-1
interface Ethernet50
channel-group 1 mode active
description Link to Brocade 40G-2
Performance Characteristics
Bandwidth Allocation
Total Uplink Capacity (Garage to House):
- 1 Gbps (Mikrotik 60GHz bridge)
- Bottleneck: All VLAN 1 and internet-bound traffic limited to 1Gbps
Garage Internal Bandwidth:
- Brocade to Hosts: 92 Gbps aggregate (12x 1Gb + 8x 10Gb bonds)
- Arista to Hosts: 160 Gbps (4x 40Gb)
- Brocade-Arista: 40 Gbps (when LAG working: 80 Gbps)
Expected Traffic Patterns
High Bandwidth Flows:
- Ceph OSD replication (VLAN 200) - 40Gb per host
- Ceph client I/O (VLAN 150) - 20Gb shared per host
- VM network traffic (VLAN 100) - 20Gb shared per host
Constrained Flows:
- Internet access - limited to 1Gbps wireless bridge
- Management traffic - shared 1Gbps wireless bridge
Troubleshooting Reference
Connectivity Testing
Test Management Access:
# From any client
ping 192.168.1.20 # Brocade
ping 192.168.1.21 # Arista
ping 192.168.1.26 # Aruba
# Test across wireless bridge
ping 192.168.1.7 # Mikrotik House
ping 192.168.1.8 # Mikrotik Shop
Test VLAN 100 (Servers):
ping 10.100.0.40 # Kubernetes VIP
ping 10.100.0.50 # Talos control-1
Test VLAN 150 (Ceph Public):
ping 10.150.0.10 # Talos control-1 storage interface
Check Link Aggregation Status
Brocade:
show lag
show interface ethernet 1/1/41
show interface ethernet 1/1/42
Arista:
show port-channel summary
show interface ethernet 49
show interface ethernet 50
Monitor Traffic
Brocade:
show interface ethernet 1/1/41 | include rate
show interface ethernet 1/1/42 | include rate
Check VLAN configuration:
show vlan
show interface brief
Known Issues and Gotchas
Active Issues
-
Brocade-Arista Interconnect Loop
- Symptom: Network storms, high CPU on switches, connectivity issues
- Current Workaround: One 40Gb link disabled on Arista side
- Root Cause: Links configured as separate trunks instead of LAG
- Solution: Configure LACP/port-channel on both switches (see above)
-
[TODO: Document other known issues]
Design Considerations
-
Wireless Bridge Bottleneck
- All internet traffic and house-to-garage limited to 1Gbps
- Management access during wireless outage is difficult
- Consider: OOB management network or local crash cart access
-
Single Point of Failure
- Wireless bridge failure isolates garage from house
- Brocade failure loses routing for VLAN 150/200
- Consider: Redundancy strategy
-
VLAN 200 Routing
- If gateway is on Brocade but traffic flows through Arista, need to verify routing
- Confirm: Does VLAN 200 need a gateway at all? (internal only)
Future Improvements
[TODO: Document planned network changes]
- Fix Brocade-Arista LAG to enable second 40Gb link
- Document complete port assignments for all switches
- Add network monitoring/observability (Prometheus exporters?)
- Consider redundant wireless link or fiber between buildings
- Implement proper change management for switch configs
- [TODO: Add your planned improvements]
Change Log
Date | Change | Person | Notes |
---|---|---|---|
2025-10-14 | Initial documentation created | Claude | Baseline network topology documentation |
[TODO] | [TODO] | [TODO] | [TODO] |
References
- Talos Configuration:
kubernetes/bootstrap/talos/talconfig.yaml
- Network Patches:
kubernetes/bootstrap/talos/patches/global/machine-network.yaml
- Kubernetes Network: See
docs/src/architecture/network.md
for application-level networking - Storage Network: See
docs/src/architecture/storage.md
for Ceph network details
Storage Architecture
Storage Overview
The Dapper Cluster uses Rook Ceph as its primary storage solution, providing unified storage for all Kubernetes workloads. The external Ceph cluster runs on Proxmox hosts and is connected to the Kubernetes cluster via Rook's external cluster mode.
graph TD subgraph External Ceph Cluster MON[Ceph Monitors] OSD[Ceph OSDs] MDS[Ceph MDS - CephFS] end subgraph Kubernetes Cluster ROOK[Rook Operator] CSI[Ceph CSI Drivers] SC[Storage Classes] end subgraph Applications APPS[Application Pods] PVC[Persistent Volume Claims] end MON --> ROOK ROOK --> CSI CSI --> SC SC --> PVC PVC --> APPS CSI --> MDS CSI --> OSD
Storage Architecture Decision
Why Rook Ceph?
The cluster migrated from OpenEBS Mayastor and various NFS backends to Rook Ceph for several key reasons:
- Unified Storage Platform: Single storage solution for all workload types
- External Cluster Design: Leverages existing Proxmox Ceph cluster infrastructure
- High Performance: Direct Ceph integration without NFS overhead
- Scalability: Native Ceph scalability for growing storage needs
- Feature Rich: Snapshots, cloning, expansion, and advanced storage features
- ReadWriteMany Support: CephFS provides shared filesystem access
- Production Proven: Mature, widely-adopted storage solution
Migration History
- Previous: OpenEBS Mayastor (block storage) + Unraid NFS backends (shared storage)
- Current: Rook Ceph with CephFS and RBD (unified storage platform)
- In Progress: Decommissioning Unraid servers (tower/tower-2) in favor of Ceph
Current Storage Classes
CephFS Shared Storage (Default)
Storage Class: cephfs-shared
Primary storage class for all workloads requiring dynamic provisioning.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cephfs-shared
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: cephfs
pool: cephfs_data
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
Characteristics:
- Access Mode: ReadWriteMany (RWX) - Multiple pods can read/write simultaneously
- Use Cases:
- Applications requiring shared storage
- Media applications
- Backup repositories (VolSync)
- Configuration storage
- General application storage
- Performance: Good performance for most workloads, shared filesystem overhead
- Default: Yes - all PVCs without explicit storageClassName use this
CephFS Static Storage
Storage Class: cephfs-static
Used for pre-existing CephFS paths that need to be mounted into Kubernetes.
Characteristics:
- Access Mode: ReadWriteMany (RWX)
- Use Cases:
- Mounting existing data directories (e.g.,
/truenas/*
paths) - Large media libraries
- Shared configuration repositories
- Data migration scenarios
- Mounting existing data directories (e.g.,
- Provisioning: Manual - requires creating both PV and PVC
- Pattern: See "Static PV Pattern" section below
Example: Media storage at /truenas/media
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-cephfs-pv
spec:
capacity:
storage: 100Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/media
RBD Block Storage
Storage Classes: ceph-rbd
, ceph-bulk
High-performance block storage using Ceph RADOS Block Devices.
Characteristics:
- Access Mode: ReadWriteOnce (RWO) - Single pod exclusive access
- Performance: Superior to CephFS for block workloads (databases, etc.)
- Thin Provisioning: Efficient storage allocation
- Features: Snapshots, clones, fast resizing
Use Cases:
- PostgreSQL and other databases
- Stateful applications requiring block storage
- Applications needing high IOPS
- Workloads migrating from OpenEBS Mayastor
Storage Classes:
ceph-rbd
: General-purpose RBD storageceph-bulk
: Erasure-coded pool for large, less-critical data
Legacy Unraid NFS Storage (Being Decommissioned)
Storage Class: used-nfs
(no storage class for static tower/tower-2 PVs)
Legacy NFS storage from Unraid servers, currently being migrated to Ceph.
Servers:
tower.manor
- Primary Unraid server (100Ti NFS) - Decommissioningtower-2.manor
- Secondary Unraid server (100Ti NFS) - Decommissioning
Current Status:
- Some media applications still use hybrid approach during migration
- Active data migration to CephFS in progress
- Will be fully retired once migration complete
Migration Plan: All workloads being moved to Ceph (CephFS or RBD as appropriate)
Storage Provisioning Patterns
Dynamic Provisioning (Default)
For most applications, simply create a PVC and Kubernetes will automatically provision storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-data
namespace: my-namespace
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
# No storageClassName specified = uses default (cephfs-shared)
Static PV Pattern
For mounting pre-existing CephFS paths:
Step 1: Create PersistentVolume
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-static-pv
spec:
capacity:
storage: 5Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/my-data # Pre-existing path in CephFS
Step 2: Create matching PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-static-pvc
namespace: my-namespace
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Ti
storageClassName: cephfs-static
volumeName: my-static-pv
Current Static PVs in Use:
media-cephfs-pv
→/truenas/media
(100Ti)minio-cephfs-pv
→/truenas/minio
(10Ti)paperless-cephfs-pv
→/truenas/paperless
(5Ti)
Storage Decision Matrix
Workload Type | Storage Class | Access Mode | Rationale |
---|---|---|---|
Databases (PostgreSQL, etc.) | ceph-rbd | RWO | Best performance for block storage workloads |
Media Libraries | cephfs-static or cephfs-shared | RWX | Shared access for media servers |
Media Downloads | cephfs-shared | RWX | Multi-pod write access |
Application Data (single pod) | ceph-rbd | RWO | High performance block storage |
Application Data (multi-pod) | cephfs-shared | RWX | Concurrent access required |
Backup Repositories | cephfs-shared | RWX | VolSync requires RWX |
Shared Config | cephfs-shared | RWX | Multiple pods need access |
Bulk Storage | ceph-bulk or cephfs-static | RWO/RWX | Large datasets, erasure coding |
Legacy Apps (during migration) | used-nfs | RWX | Temporary until Unraid decom complete |
Backup Strategy
VolSync with CephFS
All persistent data is backed up using VolSync, which now uses CephFS for its repository storage:
- Backup Frequency: Hourly snapshots via ReplicationSource
- Repository Storage: CephFS PVC (migrated from NFS)
- Backend: Restic repositories on CephFS
- Retention: Configurable per-application
- Recovery: Supports restore to same or different PVC
VolSync Repository Location: /repository/{APP}
on CephFS
Network Configuration
Ceph Networks
The external Ceph cluster uses two networks:
- Public Network: 10.150.0.0/24
- Client connections from Kubernetes
- Ceph monitor communication
- Used by CSI drivers
- Cluster Network: 10.200.0.0/24
- OSD-to-OSD replication
- Not directly accessed by Kubernetes
Connection Method
Kubernetes connects to Ceph via:
- Rook Operator: Manages connection to external cluster
- CSI Drivers: cephfs.csi.ceph.com for CephFS volumes
- Mon Endpoints: ConfigMap with Ceph monitor addresses
- Authentication: Ceph client.kubernetes credentials
Performance Characteristics
CephFS Performance
- Sequential Read: Excellent (limited by network, ~10 Gbps)
- Sequential Write: Very Good (COW overhead, CRUSH rebalancing)
- Random I/O: Good (shared filesystem overhead)
- Concurrent Access: Excellent (native RWX support)
- Metadata Operations: Good (dedicated MDS servers)
Optimization Tips
- Use RWO when possible: Even on CephFS, specify RWO if no sharing needed
- Size appropriately: CephFS handles small and large files well
- Monitor MDS health: CephFS performance depends on MDS responsiveness
- Enable client caching: Default CSI settings enable attribute caching
Storage Operations
Common Operations
Expand a PVC:
kubectl patch pvc my-pvc -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
Check storage usage:
kubectl get pvc -A
kubectl exec -it <pod> -- df -h
Monitor Ceph cluster health:
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
List CephFS mounts:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
Troubleshooting
PVC stuck in Pending:
kubectl describe pvc <pvc-name>
kubectl -n rook-ceph logs -l app=rook-ceph-operator
Slow performance:
# Check Ceph cluster health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
# Check MDS status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
# Check OSD performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
Mount issues:
# Check CSI driver logs
kubectl -n rook-ceph logs -l app=csi-cephfsplugin
# Verify connection to monitors
kubectl -n rook-ceph get configmap rook-ceph-mon-endpoints -o yaml
Current Migration Status
Completed
- ✅ RBD storage classes implemented and available
- ✅ CephFS as default storage class
- ✅ VolSync migrated to CephFS backend
- ✅ Static PV pattern established for existing data
- ✅ Migrated from OpenEBS Mayastor to Ceph RBD
In Progress
- 🔄 Decommissioning Unraid NFS servers (tower/tower-2)
- 🔄 Migrating remaining media workloads from NFS to CephFS
- 🔄 Consolidating all storage onto Ceph platform
Future Enhancements
- 📋 Additional RBD pool with SSD backing for critical workloads
- 📋 Erasure coding optimization for bulk media storage
- 📋 Advanced snapshot scheduling and retention policies
- 📋 Ceph performance tuning and optimization
Best Practices
Storage Selection
- Databases and single-pod apps: Use
ceph-rbd
for best performance - Shared storage needs: Use
cephfs-shared
for RWX access - Use static PVs for existing data: Don't duplicate large datasets
- Specify requests accurately: Helps with capacity planning
- Choose appropriate access modes: RWO for RBD, RWX for CephFS
Capacity Planning
- Monitor Ceph cluster capacity:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
- Set appropriate PVC sizes: CephFS supports expansion
- Plan for growth: Ceph cluster can scale by adding OSDs
- Regular capacity reviews: Check usage trends
Data Protection
- Enable VolSync: For all stateful applications
- Test restores regularly: Ensure backup viability
- Monitor backup success: Check ReplicationSource status
- Retain snapshots appropriately: Balance storage cost vs recovery needs
Security
- Use namespace isolation: PVCs are namespace-scoped
- Limit access with RBAC: Control who can create PVCs
- Monitor access patterns: Unusual I/O may indicate issues
- Rotate Ceph credentials: Periodically update client keys
Monitoring and Observability
Key Metrics
Monitor these metrics via Prometheus/Grafana:
- Ceph cluster health status
- OSD utilization and performance
- MDS cache hit rates
- PVC capacity usage
- CSI operation latencies
- VolSync backup success rates
Alerts
Critical alerts configured:
- Ceph cluster health warnings
- High OSD utilization (>80%)
- MDS performance degradation
- PVC approaching capacity
- VolSync backup failures
References
- Rook Documentation: rook.io/docs
- Ceph Documentation: docs.ceph.com
- Local Setup:
kubernetes/apps/rook-ceph/README.md
- Storage Classes:
kubernetes/apps/rook-ceph/rook-ceph-cluster/app/storageclasses.yaml
Media Applications
Media Stack Overview
The media stack provides automated media management and streaming services using the *arr suite of applications and Plex Media Server.
graph TD subgraph Content Acquisition SONARR[Sonarr - TV Shows] SONARR_UHD[Sonarr UHD - 4K TV] RADARR[Radarr - Movies] RADARR_UHD[Radarr UHD - 4K Movies] BAZARR[Bazarr - Subtitles] BAZARR_UHD[Bazarr UHD - 4K Subtitles] end subgraph Download Clients SABNZBD[SABnzbd - Usenet] NZBGET[NZBget - Usenet Alt] end subgraph Media Storage CEPHFS[CephFS Static PV<br>/truenas/media] TOWER[Tower NFS<br>/mnt/user] TOWER2[Tower-2 NFS<br>/mnt/user] end subgraph Media Server PLEX[Plex Media Server] TAUTULLI[Tautulli - Analytics] OVERSEERR[Overseerr - Requests] end subgraph Post Processing TDARR[Tdarr - Transcoding] KOMETA[Kometa - Metadata] end SONARR --> SABNZBD RADARR --> SABNZBD SABNZBD --> CEPHFS SABNZBD --> TOWER SABNZBD --> TOWER2 CEPHFS --> PLEX TOWER --> PLEX TOWER2 --> PLEX PLEX --> TAUTULLI OVERSEERR --> SONARR OVERSEERR --> RADARR TDARR --> CEPHFS KOMETA --> PLEX
Storage Configuration
Primary Media Library - CephFS
The main media library is stored on CephFS using a static PV that mounts the pre-existing /truenas/media
directory.
Configuration: kubernetes/apps/media/storage/app/media-cephfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-cephfs-pv
spec:
capacity:
storage: 100Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/media
Mount Pattern: Applications mount this PVC at /media
or specific subdirectories:
/media/downloads
- Download staging area/media/tv
- TV show library/media/movies
- Movie library/media/music
- Music library/media/books
- Book library
Benefits:
- ReadWriteMany: Multiple pods can access simultaneously
- High Performance: Direct CephFS access, no NFS overhead
- Shared Access: All media apps see the same filesystem
- Snapshots: VolSync backups protect the data
Legacy NFS Mounts (Unraid)
Download clients and some media applications use legacy NFS mounts from Unraid servers alongside CephFS.
Servers:
tower.manor
- Primary Unraid server (100Ti NFS)tower-2.manor
- Secondary Unraid server (100Ti NFS)
Current Usage:
- SABnzbd downloads to all three storage backends (CephFS, tower, tower-2)
- Plex reads media from all three storage backends
- Active downloads and in-progress media on Unraid
- Organized/completed media on CephFS
Status: Legacy - gradual migration to CephFS in progress
Configuration: Static PVs without storage class
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-tower-pv
spec:
capacity:
storage: 100Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
nfs:
server: tower.manor
path: /mnt/user
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-tower-2-pv
spec:
capacity:
storage: 100Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
nfs:
server: tower-2.manor
path: /mnt/user
Core Components
Media Server
Plex Media Server
Namespace: media
Purpose: Media streaming and library management
Plex is the primary media server, providing:
- Streaming to multiple devices
- Hardware transcoding (Intel Quick Sync)
- Library organization and metadata
- User management and sharing
- Remote access
Configuration: kubernetes/apps/media/plex/app/helmrelease.yaml
Storage Mounts (Multi-backend):
- Media library (CephFS): CephFS static PV at
/media
- Media library (Tower): Tower NFS at
/tower
- Media library (Tower-2): Tower-2 NFS at
/tower-2
- Configuration: CephFS dynamic PVC (10Gi)
- Transcoding cache: EmptyDir (temporary)
Library Configuration:
- Plex libraries configured to scan all three storage backends
- Unified library view across CephFS and Unraid storage
Resource Allocation:
resources:
requests:
cpu: 2000m
memory: 4Gi
gpu.intel.com/i915: 1
limits:
cpu: 8000m
memory: 16Gi
gpu.intel.com/i915: 1
Hardware Acceleration: Intel Quick Sync enabled for transcoding
Tautulli
Namespace: media
Purpose: Plex analytics and monitoring
Provides:
- Watch history and statistics
- User activity monitoring
- Notification triggers
- API for automation
Storage: CephFS dynamic PVC (5Gi) for database and logs
Content Acquisition (*arr Suite)
Sonarr / Sonarr UHD
Purpose: TV show automation
- Sonarr: Standard quality TV shows
- Sonarr UHD: 4K/UHD TV shows
Features:
- TV series tracking and monitoring
- Episode search and download
- Quality profiles and upgrades
- Calendar and schedule tracking
Storage:
- Configuration: CephFS dynamic PVC (10Gi)
- Media access: CephFS static PV (shared
/media
)
Radarr / Radarr UHD
Purpose: Movie automation
- Radarr: Standard quality movies
- Radarr UHD: 4K/UHD movies
Features:
- Movie library management
- Automated downloads
- Quality management
- List integration (IMDb, Trakt)
Storage:
- Configuration: CephFS dynamic PVC (10Gi)
- Media access: CephFS static PV (shared
/media
)
Bazarr / Bazarr UHD
Purpose: Subtitle management
Automated subtitle downloading for:
- TV shows (via Sonarr integration)
- Movies (via Radarr integration)
- Multiple languages
- Subtitle providers
Storage: CephFS dynamic PVC (5Gi)
Download Clients
SABnzbd
Namespace: media
Purpose: Primary Usenet download client
Features:
- NZB file processing
- Automated post-processing
- Category-based handling
- Integration with *arr apps
Storage Mounts (Multi-backend):
- Configuration: CephFS dynamic PVC (5Gi)
- Downloads (CephFS): CephFS static PV
/media/downloads/usenet
- Downloads (Tower): Tower NFS
/tower/downloads/usenet
- Downloads (Tower-2): Tower-2 NFS
/tower-2/downloads/usenet
- Incomplete: CephFS dynamic PVC (temporary downloads)
Download Strategy:
- Categories route to different storage backends
- Active downloads use appropriate backend based on category
- Completed downloads moved to final library location
Post-Processing: Automatically moves completed downloads to appropriate media folders
NZBget
Namespace: media
Purpose: Alternative Usenet client
Lightweight alternative to SABnzbd for specific use cases.
Storage: Similar pattern to SABnzbd
Post-Processing
Tdarr
Purpose: Media transcoding and file optimization
Components:
- Tdarr Server: Manages transcoding queue
- Tdarr Node: CPU-based transcoding workers
- Tdarr Node GPU: GPU-accelerated transcoding
Use Cases:
- Convert media to h265/HEVC
- Reduce file sizes
- Standardize formats
- Remove unwanted audio/subtitle tracks
Storage:
- Configuration: CephFS dynamic PVC (25Gi)
- Media access: CephFS static PV (shared
/media
) - Transcode cache: CephFS dynamic PVC (100Gi)
Resource Intensive: Uses significant CPU/GPU resources during transcoding
Kometa (formerly Plex Meta Manager)
Purpose: Enhanced Plex metadata and collections
Features:
- Automated collections (e.g., "Top Rated 2023")
- Poster and artwork management
- Rating and tag synchronization
- Scheduled metadata updates
Storage: CephFS dynamic PVC (5Gi) for configuration
User Management
Overseerr
Namespace: media
Purpose: Media request management
User-facing application for:
- Media requests (movies/TV shows)
- Request approval workflow
- User quotas and limits
- Integration with Sonarr/Radarr
Authentication: Integrated with Plex accounts
Storage: CephFS dynamic PVC (5Gi)
Network Configuration
Internal Access
All media applications are accessible via internal DNS:
spec:
ingressClassName: internal
hosts:
- host: plex.chelonianlabs.com
paths:
- path: /
pathType: Prefix
External Access
Plex is accessible externally via:
- Cloudflared tunnel for secure access
- Direct access on port 32400 (firewall controlled)
Service Discovery
Applications discover each other via Kubernetes services:
sonarr.media.svc.cluster.local:8989
radarr.media.svc.cluster.local:7878
sabnzbd.media.svc.cluster.local:8080
plex.media.svc.cluster.local:32400
Backup Strategy
Application Configuration
All *arr application configurations are backed up via VolSync:
Backup Schedule: Hourly Retention:
- Hourly: 24 snapshots
- Daily: 7 snapshots
- Weekly: 4 snapshots
Backup Pattern:
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
name: sonarr
namespace: media
spec:
sourcePVC: sonarr-config
trigger:
schedule: "0 * * * *"
restic:
repository: sonarr-restic-secret
retain:
hourly: 24
daily: 7
weekly: 4
Media Library
Media files are NOT backed up via VolSync due to size (100Ti+)
Protection Strategy:
- Ceph replication (3x copies across OSDs)
- Replaceable content (can be re-downloaded)
- Critical media manually backed up externally
Configuration Backup: All *arr databases and settings are backed up
Resource Management
Resource Allocation Strategy
Media applications have varying resource needs:
High Resource:
- Plex: 2-8 CPU, 4-16Gi RAM, GPU for transcoding
- Tdarr: 4-16 CPU, 8-32Gi RAM, GPU optional
Medium Resource:
- Sonarr/Radarr: 500m-2 CPU, 512Mi-2Gi RAM
- SABnzbd: 1-4 CPU, 1-4Gi RAM
Low Resource:
- Bazarr: 100m-500m CPU, 128Mi-512Mi RAM
- Overseerr: 100m-500m CPU, 256Mi-1Gi RAM
Storage Quotas
Dynamic PVCs sized appropriately:
- Configuration: 5-10Gi (databases, logs)
- Download buffers: 100Gi (temporary downloads)
- Transcode cache: 100Gi (Tdarr working space)
Maintenance
Regular Tasks
Weekly:
- Review failed downloads
- Check disk space usage
- Verify backup completion
- Update metadata (Kometa)
Monthly:
- Library maintenance (Plex)
- Database optimization (*arr apps)
- Review and cleanup old downloads
- Check for application updates (Renovate handles this)
Health Monitoring
Key Metrics:
- Plex stream count and transcoding sessions
- SABnzbd download queue and speed
- *arr indexer health and search failures
- Storage capacity and growth rate
Alerts:
- Download failures
- Indexer connectivity issues
- Storage capacity warnings
- Failed backup jobs
Troubleshooting
Common Issues
Plex can't see media files:
# Check PVC mount
kubectl exec -n media deployment/plex -- ls -la /media
# Verify permissions
kubectl exec -n media deployment/plex -- ls -ld /media/movies /media/tv
# Check Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
Downloads not moving to library:
# Check SABnzbd logs
kubectl logs -n media deployment/sabnzbd --tail=100
# Verify shared storage access
kubectl exec -n media deployment/sabnzbd -- ls -la /media/downloads/usenet
# Check Sonarr/Radarr import
kubectl logs -n media deployment/sonarr --tail=100 | grep -i import
Slow transcoding:
# Verify GPU allocation
kubectl describe pod -n media -l app.kubernetes.io/name=plex | grep -A5 "Limits\|Requests"
# Check GPU utilization (on node)
intel_gpu_top
# Review transcode logs
kubectl logs -n media deployment/plex | grep -i transcode
Storage full:
# Check PVC usage
kubectl get pvc -n media
# Check storage usage in pod
kubectl exec -n media deployment/plex -- df -h | grep media
# Check Ceph cluster capacity
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
Best Practices
Storage Organization
Directory Structure:
/media/
├── downloads/
│ ├── usenet/ # SABnzbd downloads
│ └── complete/ # Completed downloads
├── movies/ # Radarr managed
│ ├── 4k/ # UHD content
│ └── 1080p/ # HD content
├── tv/ # Sonarr managed
│ ├── 4k/
│ └── 1080p/
├── music/
└── books/
Quality Profiles
- Use separate instances for 4K content (Sonarr UHD, Radarr UHD)
- Configure appropriate quality cutoffs
- Enable upgrades for better releases
- Set size limits to prevent excessive downloads
Download Management
- Configure category-based post-processing in SABnzbd
- Use download client categories in *arr apps
- Enable completed download handling
- Set appropriate retention for download history
Performance Optimization
- Use hardware transcoding (Intel Quick Sync)
- Pre-optimize media with Tdarr (h265/HEVC)
- Adjust Plex transcoder quality settings
- Enable Plex optimize versions for common devices
Security Considerations
Access Control
- Internal Network Only: Media apps exposed only via internal ingress
- Authentication Required: All apps require login
- Plex Managed Auth: User access controlled via Plex sharing
- Overseerr Integration: Request permissions via Plex accounts
API Keys
- All API keys stored in Kubernetes secrets
- External Secrets integration with Infisical
- Regular key rotation via automation
- Least privilege access between services
Future Improvements
Planned Enhancements
- GPU Transcoding Pool: Dedicated GPU nodes for Tdarr
- Request Automation: Auto-approve for trusted users
- Advanced Monitoring: Grafana dashboards for media metrics
- Content Analysis: Automated duplicate detection
- Unraid Migration: Gradual migration of tower/tower-2 NFS storage to CephFS
- Currently using hybrid approach (CephFS + tower + tower-2)
- Plan: Consolidate all media storage to CephFS
- Timeline: When Unraid servers are decommissioned
Under Consideration
- Jellyfin: Alternative media server for comparison
- Prowlarr: Unified indexer management
- Readarr: Book management automation
- Lidarr: Music management automation
References
- Media Storage:
kubernetes/apps/media/storage/
- Plex:
kubernetes/apps/media/plex/
- Sonarr:
kubernetes/apps/media/sonarr/
- Radarr:
kubernetes/apps/media/radarr/
- SABnzbd:
kubernetes/apps/media/sabnzbd/
- Storage Architecture:
docs/src/architecture/storage.md
Network
Storage Applications
This document covers the storage-related applications and services running in the cluster.
Storage Stack Overview
graph TD subgraph External Infrastructure CEPH[Proxmox Ceph Cluster] end subgraph Kubernetes Storage Layer ROOK[Rook Ceph Operator] CSI[Ceph CSI Drivers] VOLSYNC[VolSync] end subgraph Storage Consumers APPS[Applications] BACKUPS[Backup Repositories] end CEPH --> ROOK ROOK --> CSI CSI --> APPS VOLSYNC --> BACKUPS CSI --> VOLSYNC
Core Components
Rook Ceph Operator
Namespace: rook-ceph
Type: Helm Release
Purpose: Manages connection to external Ceph cluster and provides CSI drivers
The Rook operator is the bridge between Kubernetes and the external Ceph cluster. It:
- Manages CSI driver deployments
- Maintains connection to Ceph monitors
- Handles authentication and secrets
- Provides CephFS filesystem access
Configuration: kubernetes/apps/rook-ceph/rook-ceph-operator/app/helmrelease.yaml
Current Setup:
- CephFS Driver: Enabled ✅
- RBD Driver: Disabled (Phase 2)
- Connection Mode: External cluster
- Network: Public network 10.150.0.0/24
Key Resources:
# Check operator status
kubectl -n rook-ceph get pods -l app=rook-ceph-operator
# View operator logs
kubectl -n rook-ceph logs -l app=rook-ceph-operator -f
# Check CephCluster resource
kubectl -n rook-ceph get cephcluster
Rook Ceph Cluster Configuration
Namespace: rook-ceph
Type: CephCluster Custom Resource
Purpose: Defines external Ceph cluster connection
Configuration: kubernetes/apps/rook-ceph/rook-ceph-cluster/app/cluster-external.yaml
This resource tells Rook how to connect to the external Ceph cluster:
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
external:
enable: true
dataDirHostPath: /var/lib/rook
cephVersion:
image: quay.io/ceph/ceph:v18
Monitor Configuration: Defined in ConfigMap rook-ceph-mon-endpoints
- Contains Ceph monitor IP addresses
- Critical for cluster connectivity
- Automatically referenced by CSI drivers
Authentication: Stored in Secret rook-ceph-mon
- Contains
client.kubernetes
Ceph credentials - Encrypted with SOPS
- Referenced by all CSI operations
Ceph CSI Drivers
Namespace: rook-ceph
Type: DaemonSet (nodes) + Deployment (provisioner)
Purpose: Enable Kubernetes to mount CephFS volumes
Components:
-
csi-cephfsplugin (DaemonSet)
- Runs on every node
- Mounts CephFS volumes to pods
- Handles node-level operations
-
csi-cephfsplugin-provisioner (Deployment)
- Creates/deletes CephFS subvolumes
- Handles dynamic provisioning
- Manages volume expansion
Monitoring:
# Check CSI pods
kubectl -n rook-ceph get pods -l app=csi-cephfsplugin
# View CSI driver logs
kubectl -n rook-ceph logs -l app=csi-cephfsplugin -c csi-cephfsplugin
# Check provisioner
kubectl -n rook-ceph get pods -l app=csi-cephfsplugin-provisioner
Storage Classes
Configuration: kubernetes/apps/rook-ceph/rook-ceph-cluster/app/storageclasses.yaml
cephfs-shared (Default)
Primary storage class for all dynamic provisioning:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cephfs-shared
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: cephfs
pool: cephfs_data
allowVolumeExpansion: true
reclaimPolicy: Delete
Usage: Default for all PVCs without explicit storageClassName
cephfs-static
For mounting pre-existing CephFS directories:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cephfs-static
provisioner: rook-ceph.cephfs.csi.ceph.com
# Used with manually created PVs pointing to existing paths
Usage: Requires manual PV creation, see examples below
VolSync
Namespace: storage
Type: Helm Release
Purpose: Backup and recovery for Persistent Volume Claims
VolSync provides automated backup of all stateful applications using Restic.
Configuration: kubernetes/apps/storage/volsync/app/helmrelease.yaml
Backup Repository: CephFS-backed PVC
- Location:
volsync-cephfs-pvc
(5Ti) - Path:
/repository/{APP}/
for each application - Previous: NFS on vault.manor (migrated to CephFS)
How It Works:
- Applications create
ReplicationSource
resources - VolSync creates backup pods with mover containers
- Mover mounts both application PVC and repository PVC
- Restic backs up data to repository
- Retention policies keep configured snapshot count
Backup Pattern:
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
name: my-app
namespace: my-namespace
spec:
sourcePVC: my-app-data
trigger:
schedule: "0 * * * *" # Hourly
restic:
repository: my-app-restic-secret
retain:
hourly: 24
daily: 7
weekly: 4
Common Operations:
# Manual backup trigger
task volsync:snapshot NS=<namespace> APP=<app>
# List snapshots
task volsync:run NS=<namespace> REPO=<app> -- snapshots
# Unlock repository (if locked)
task volsync:unlock-local NS=<namespace> APP=<app>
# Restore to new PVC
task volsync:restore NS=<namespace> APP=<app>
Repository PVC Configuration: kubernetes/apps/storage/volsync/app/volsync-cephfs-pv.yaml
Static PV Examples
Media Storage
Large media library mounted from pre-existing CephFS path:
Location: kubernetes/apps/media/storage/app/media-cephfs-pv.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: media-cephfs-pv
spec:
capacity:
storage: 100Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/media
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: media-cephfs-pvc
namespace: media
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Ti
storageClassName: cephfs-static
volumeName: media-cephfs-pv
Usage: Mounted by Plex, Sonarr, Radarr, etc. for media library access
Minio Object Storage
Minio data stored on CephFS:
Location: kubernetes/apps/storage/minio/app/minio-cephfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: minio-cephfs-pv
spec:
capacity:
storage: 10Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/minio
Paperless-ngx Document Storage
Document management system storage:
Location: kubernetes/apps/selfhosted/paperless-ngx/app/paperless-cephfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: paperless-cephfs-pv
spec:
capacity:
storage: 5Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/paperless
Storage Operations
Creating a New Static PV
Step 1: Create directory in CephFS (on Proxmox Ceph node)
# SSH to a Proxmox node with Ceph access
mkdir -p /mnt/cephfs/truenas/my-app
chmod 777 /mnt/cephfs/truenas/my-app # Or appropriate permissions
Step 2: Create PV manifest
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-app-cephfs-pv
spec:
capacity:
storage: 1Ti
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: cephfs-static
csi:
driver: rook-ceph.cephfs.csi.ceph.com
nodeStageSecretRef:
name: rook-csi-cephfs-static
namespace: rook-ceph
volumeAttributes:
clusterID: rook-ceph
fsName: cephfs
staticVolume: "true"
rootPath: /truenas/my-app
Step 3: Create PVC manifest
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-cephfs-pvc
namespace: my-namespace
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Ti
storageClassName: cephfs-static
volumeName: my-app-cephfs-pv
Step 4: Apply and verify
kubectl apply -f pv.yaml
kubectl apply -f pvc.yaml
kubectl get pv my-app-cephfs-pv
kubectl get pvc -n my-namespace my-app-cephfs-pvc
Expanding a PVC
CephFS supports online volume expansion:
# Edit PVC to increase size
kubectl patch pvc my-pvc -n my-namespace -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
# Verify expansion
kubectl get pvc -n my-namespace my-pvc -w
Note: Size can only increase, not decrease
Troubleshooting Mount Issues
PVC stuck in Pending:
# Check PVC events
kubectl describe pvc -n <namespace> <pvc-name>
# Check CSI driver logs
kubectl -n rook-ceph logs -l app=csi-cephfsplugin -c csi-cephfsplugin --tail=100
# Verify storage class exists
kubectl get sc cephfs-shared
Pod can't mount volume:
# Check pod events
kubectl describe pod -n <namespace> <pod-name>
# Verify Ceph cluster connectivity
kubectl -n rook-ceph get cephcluster
# Check Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Verify CephFS is available
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
Slow I/O performance:
# Check MDS performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
# Check OSD performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
# Identify slow operations
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
Monitoring and Alerts
Key Metrics
Monitor these via Prometheus/Grafana:
-
Storage Capacity
- Ceph cluster utilization
- Individual PVC usage
- Growth trends
-
Performance
- CSI operation latency
- MDS cache hit ratio
- OSD I/O rates
-
Reliability
- VolSync backup success rate
- Ceph health status
- CSI driver availability
Useful Queries
Check all PVCs by size:
kubectl get pvc -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,SIZE:.spec.resources.requests.storage,STORAGECLASS:.spec.storageClassName --sort-by=.spec.resources.requests.storage
Find PVCs using old storage classes:
kubectl get pvc -A -o json | jq -r '.items[] | select(.spec.storageClassName == "nfs-csi" or .spec.storageClassName == "mayastor-etcd-localpv") | "\(.metadata.namespace)/\(.metadata.name) - \(.spec.storageClassName)"'
Check Ceph cluster capacity:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
Monitor VolSync backups:
# Check all ReplicationSources
kubectl get replicationsource -A
# Check specific backup status
kubectl get replicationsource -n <namespace> <app> -o jsonpath='{.status.lastSyncTime}'
Backup and Recovery
VolSync Backup Workflow
- Application creates ReplicationSource
- VolSync creates backup pod (every hour by default)
- Restic backs up PVC to repository
- Snapshots retained per retention policy
- Status updated in ReplicationSource
Restore Procedures
Restore to original PVC:
# Scale down application
kubectl scale deployment -n <namespace> <app> --replicas=0
# Run restore
task volsync:restore NS=<namespace> APP=<app>
# Scale up application
kubectl scale deployment -n <namespace> <app> --replicas=1
Restore to new PVC:
- Create ReplicationDestination pointing to new PVC
- VolSync will restore data from repository
- Update application to use new PVC
- Verify data integrity
Disaster Recovery
Complete cluster rebuild:
- Deploy new Kubernetes cluster
- Install Rook with same external Ceph connection
- Recreate storage classes
- Deploy VolSync
- Restore all applications from backups
CephFS corruption:
- Check Ceph health and repair if possible
- If unrecoverable, restore from VolSync backups
- VolSync repository is on CephFS, so ensure repository is intact
- Consider external backup of VolSync repository
Security Considerations
Ceph Authentication
- Client Key:
client.kubernetes
Ceph user - Permissions: Limited to CephFS pools only
- Storage: SOPS-encrypted in
rook-ceph-mon
secret - Rotation: Should be rotated periodically
PVC Access Control
- Namespace Isolation: PVCs are namespace-scoped
- RBAC: Control who can create/delete PVCs
- Pod Security: Pods must have appropriate security context
- Network Policies: Limit which pods can access storage
Backup Security
- VolSync Repository: Protected by Kubernetes RBAC
- Restic Encryption: Repository encryption with per-app keys
- Snapshot Access: Controlled via ReplicationSource ownership
Future Enhancements (Phase 2)
RBD Block Storage
When Mayastor hardware is repurposed:
- Enable RBD driver in Rook operator
- Create RBD pools on Ceph cluster:
ssd-db
- Critical workloadsrook-pvc-pool
- General purposemedia-bulk
- Erasure-coded bulk storage
- Deploy RBD storage classes
- Migrate workloads based on performance requirements
Planned Improvements
- Ceph dashboard integration
- Advanced monitoring dashboards
- Automated capacity alerts
- Storage QoS policies
- Cross-cluster replication
References
- Rook Operator:
kubernetes/apps/rook-ceph/rook-ceph-operator/
- Cluster Config:
kubernetes/apps/rook-ceph/rook-ceph-cluster/
- Storage Classes:
kubernetes/apps/rook-ceph/rook-ceph-cluster/app/storageclasses.yaml
- VolSync:
kubernetes/apps/storage/volsync/
- Architecture:
docs/src/architecture/storage.md
Observability
Installation Guide
Prerequisites
graph TD subgraph Hardware CP[Control Plane Nodes] GPU[GPU Worker Node] Worker[Worker Nodes] end subgraph Software OS[Operating System] Tools[Required Tools] Network[Network Setup] end subgraph Configuration Git[Git Repository] Secrets[SOPS Setup] Certs[Certificates] end
Hardware Requirements
Control Plane Nodes (3x)
- CPU: 4 cores per node
- RAM: 16GB per node
- Role: Cluster control plane
GPU Worker Node (1x)
- CPU: 16 cores
- RAM: 128GB
- GPU: 4x NVIDIA Tesla P100
- Role: GPU-accelerated workloads
Worker Nodes (2x)
- CPU: 16 cores per node
- RAM: 128GB per node
- Role: General workloads
Software Prerequisites
-
Operating System
- Linux distribution
- Updated system packages
- Required kernel modules
- NVIDIA drivers (for GPU node)
-
Required Tools
- kubectl
- flux
- SOPS
- age/gpg
- task
Initial Setup
1. Repository Setup
# Clone the repository
git clone https://github.com/username/dapper-cluster.git
cd dapper-cluster
# Create configuration
cp config.sample.yaml config.yaml
2. Configuration
graph LR Config[Configuration] --> Secrets[Secrets Management] Config --> Network[Network Settings] Config --> Storage[Storage Setup] Secrets --> SOPS[SOPS Encryption] Network --> DNS[DNS Setup] Storage --> CSI[CSI Drivers]
Edit Configuration
cluster:
name: dapper-cluster
domain: example.com
network:
cidr: 10.0.0.0/16
storage:
ceph:
# External Ceph cluster connection
# Configured via Rook operator after bootstrap
monitors: [] # Set during Rook deployment
3. Secrets Management
- Generate age key
- Configure SOPS
- Encrypt sensitive files
4. Bootstrap Process
graph TD Start[Start Installation] --> CP[Bootstrap Control Plane] CP --> Workers[Join Worker Nodes] Workers --> GPU[Configure GPU Node] GPU --> Flux[Install Flux] Flux --> Apps[Deploy Apps]
Bootstrap Commands
# Initialize flux
task flux:bootstrap
# Verify installation
task cluster:verify
# Verify GPU support
kubectl get nodes -o wide
nvidia-smi # on GPU node
Post-Installation
1. Verify Components
- Check control plane health
- Verify worker node status
- Test GPU functionality
- Check Rook Ceph connection
- Verify storage classes
- Verify network connectivity
2. Deploy Applications
- Deploy core services
- Configure monitoring
- Setup backup systems
- Deploy GPU-enabled workloads
3. Security Setup
- Configure network policies
- Setup certificate management
- Enable monitoring and alerts
- Secure GPU access
Troubleshooting
Common installation issues and solutions:
-
Control Plane Issues
- Verify etcd cluster health
- Check control plane components
- Review system logs
-
Worker Node Issues
- Verify node join process
- Check kubelet status
- Review node logs
-
GPU Node Issues
- Verify NVIDIA driver installation
- Check NVIDIA container runtime
- Validate GPU visibility in cluster
-
Storage Issues
- Verify Ceph cluster connectivity
- Check Rook operator status
- Verify storage class configuration
- Review CephCluster resource health
- Check PV/PVC status
-
Network Problems
- Check DNS resolution
- Verify network policies
- Review ingress configuration
Maintenance
Regular Tasks
- System updates
- Certificate renewal
- Backup verification
- Security audits
- GPU driver updates
Health Checks
- Component status
- Resource usage
- Storage capacity
- Network connectivity
- GPU health
Next Steps
After successful installation:
- Review Architecture Overview
- Configure Storage
- Setup Network
- Deploy Applications
Maintenance Guide
Maintenance Overview
graph TD subgraph Regular Tasks Updates[System Updates] Backups[Backup Tasks] Monitoring[Health Checks] end subgraph Periodic Tasks Audit[Security Audits] Cleanup[Resource Cleanup] Review[Config Review] end Updates --> Verify[Verification] Backups --> Test[Backup Testing] Monitoring --> Alert[Alert Response]
Regular Maintenance
Daily Tasks
- Monitor system health
- Check cluster status
- Review resource usage
- Verify backup completion
- Check alert status
Weekly Tasks
- Review system logs
- Check storage usage
- Verify backup integrity
- Update documentation
Monthly Tasks
- Security updates
- Certificate rotation
- Resource optimization
- Performance review
Update Procedures
Flux Updates
graph LR PR[Pull Request] --> Review[Review Changes] Review --> Test[Test Environment] Test --> Deploy[Deploy to Prod] Deploy --> Monitor[Monitor Status]
Application Updates
- Review release notes
- Test in staging if available
- Update flux manifests
- Monitor deployment
- Verify functionality
Backup Management
Backup Strategy
graph TD Apps[Applications] --> Data[Data Backup] Config[Configurations] --> Git[Git Repository] Secrets[Secrets] --> Vault[Secret Storage] Data --> Verify[Verification] Git --> Verify Vault --> Verify
Backup Verification
- Regular restore testing
- Data integrity checks
- Recovery time objectives
- Backup retention policy
Resource Management
Cleanup Procedures
-
Remove unused resources
- Orphaned PVCs
- Completed jobs
- Old backups
- Unused configs
-
Storage optimization
- Compress old logs
- Archive unused data
- Clean container cache
Monitoring and Alerts
Key Metrics
- Node health
- Pod status
- Resource usage
- Storage capacity
- Network performance
Alert Response
- Acknowledge alert
- Assess impact
- Investigate root cause
- Apply fix
- Document resolution
Security Maintenance
Regular Tasks
graph TD Audit[Security Audit] --> Review[Review Findings] Review --> Update[Update Policies] Update --> Test[Test Changes] Test --> Document[Document Changes]
Security Checklist
- Review network policies
- Check certificate expiration
- Audit access controls
- Review secret rotation
- Scan for vulnerabilities
Troubleshooting Guide
Common Issues
-
Node Problems
- Check node status
- Review system logs
- Verify resource usage
- Check connectivity
-
Storage Issues
- Check Ceph cluster health
- Verify CephFS status
- Monitor storage capacity
- Review OSD performance
- Check MDS responsiveness
- Verify PVC mount status
-
Network Problems
- Check DNS resolution
- Verify network policies
- Review ingress status
- Test connectivity
Recovery Procedures
- Node Recovery
# Check node status
kubectl get nodes
# Drain node for maintenance
kubectl drain node-name
# Perform maintenance
# ...
# Uncordon node
kubectl uncordon node-name
- Storage Recovery
# Check Ceph cluster health
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Check PV status
kubectl get pv
# Check PVC status
kubectl get pvc -A
# Verify storage class
kubectl get sc
# Check CephFS status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status
# Check OSD status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree
Documentation
Maintenance Logs
- Keep detailed records
- Document changes
- Track issues
- Update procedures
Review Process
- Regular documentation review
- Update procedures
- Verify accuracy
- Add new sections
Best Practices
-
Change Management
- Use git workflow
- Test changes
- Document updates
- Monitor results
-
Resource Management
- Regular cleanup
- Optimize usage
- Monitor trends
- Plan capacity
-
Security
- Regular audits
- Update policies
- Monitor access
- Review logs
Troubleshooting Guide
Diagnostic Workflow
graph TD Issue[Issue Detected] --> Triage[Triage] Triage --> Diagnose[Diagnose] Diagnose --> Fix[Apply Fix] Fix --> Verify[Verify] Verify --> Document[Document]
Common Issues
1. Cluster Health Issues
Node Problems
graph TD Node[Node Issue] --> Check[Check Status] Check --> |Healthy| Resources[Resource Issue] Check --> |Unhealthy| System[System Issue] Resources --> Memory[Memory] Resources --> CPU[CPU] Resources --> Disk[Disk] System --> Logs[Check Logs] System --> Network[Network]
Diagnosis Steps:
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check system resources
kubectl top nodes
kubectl top pods --all-namespaces
# Check system logs
kubectl logs -n kube-system <pod-name>
2. Storage Issues
Volume Problems
graph LR PV[PV Issue] --> Status[Check Status] Status --> |Bound| Access[Access Issue] Status --> |Pending| Provision[Provisioning Issue] Status --> |Failed| Storage[Storage System]
Resolution Steps:
# Check PV/PVC status
kubectl get pv,pvc --all-namespaces
# Check storage class
kubectl get sc
# Check provisioner pods
kubectl get pods -n storage
3. Network Issues
Connectivity Problems
graph TD Net[Network Issue] --> DNS[DNS Check] Net --> Ingress[Ingress Check] Net --> Policy[Network Policy] DNS --> CoreDNS[CoreDNS Pods] Ingress --> Traefik[Traefik Logs] Policy --> Rules[Policy Rules]
Diagnostic Commands:
# Check DNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check ingress
kubectl get ingress --all-namespaces
kubectl describe ingress <ingress-name> -n <namespace>
4. Application Issues
Pod Problems
graph TD Pod[Pod Issue] --> Status[Check Status] Status --> |Pending| Schedule[Scheduling] Status --> |CrashLoop| Crash[Container Crash] Status --> |Error| Logs[Check Logs]
Troubleshooting Steps:
# Check pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
Flux Issues
GitOps Troubleshooting
graph TD Flux[Flux Issue] --> Source[Source Controller] Flux --> Kust[Kustomize Controller] Flux --> Helm[Helm Controller] Source --> Git[Git Repository] Kust --> Sync[Sync Status] Helm --> Release[Release Status]
Resolution Steps:
# Check Flux components
flux check
# Check sources
flux get sources git
flux get sources helm
# Check reconciliation
flux get kustomizations
flux get helmreleases
Performance Issues
Resource Constraints
graph LR Perf[Performance] --> CPU[CPU Usage] Perf --> Memory[Memory Usage] Perf --> IO[I/O Usage] CPU --> Limit[Resource Limits] Memory --> Constraint[Memory Constraints] IO --> Bottleneck[I/O Bottleneck]
Analysis Commands:
# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# Check resource quotas
kubectl get resourcequota -n <namespace>
Recovery Procedures
1. Node Recovery
- Drain node
- Perform maintenance
- Uncordon node
- Verify workloads
2. Storage Recovery
- Backup data
- Fix storage issues
- Restore data
- Verify access
3. Network Recovery
- Check connectivity
- Verify DNS
- Test ingress
- Update policies
Best Practices
1. Logging
- Maintain detailed logs
- Set appropriate retention
- Use structured logging
- Enable audit logging
2. Monitoring
- Set up alerts
- Monitor resources
- Track metrics
- Use dashboards
3. Documentation
- Document issues
- Record solutions
- Update procedures
- Share knowledge
Emergency Procedures
Critical Issues
- Assess impact
- Implement temporary fix
- Plan permanent solution
- Update documentation
Contact Information
- Maintain escalation paths
- Keep contact list updated
- Document response times
- Track incidents
Network Operations Runbook
Overview
This runbook provides step-by-step procedures for common network operations, troubleshooting, and emergency recovery scenarios for the Dapper Cluster network.
Quick Reference:
- Network Topology Documentation
- Brocade Core: 192.168.1.20
- Arista Distribution: 192.168.1.21
- Aruba Access: 192.168.1.26
Table of Contents
- Common Operations
- Troubleshooting Procedures
- Emergency Procedures
- Switch Configuration
- Performance Monitoring
- Maintenance Windows
Common Operations
Accessing Network Equipment
SSH to Switches
Brocade ICX6610:
ssh admin@192.168.1.20
# [TODO: Document default credentials location]
Arista 7050:
ssh admin@192.168.1.21
# [TODO: Document default credentials location]
Aruba S2500-48p:
ssh admin@192.168.1.26
# [TODO: Document default credentials location]
Console Access
When SSH is unavailable:
# [TODO: Document console server or direct serial access]
# Brocade: Serial settings [TODO: baud rate, etc]
# Arista: Serial settings [TODO: baud rate, etc]
Checking Switch Health
Brocade ICX6610
# Basic health check
show version
show chassis
show cpu
show memory
show log tail 50
# Temperature and power
show inline power
show environment
# Check for errors
show logging | include error
show logging | include warn
Arista 7050
# Basic health check
show version
show environment all
show processes top
# Check for errors
show logging last 100
show logging | grep -i error
Verifying VLAN Configuration
Check VLAN Assignments
Brocade:
show vlan
# Check specific VLAN
show vlan 100
show vlan 150
show vlan 200
# Check which ports are in which VLANs
show vlan ethernet 1/1/1
Arista:
show vlan
# Check VLAN details
show vlan id 200
# Show interfaces by VLAN
show interfaces status
Verify Trunk Ports
Brocade:
# Show trunk configuration
show interface brief | include Trunk
# Show specific trunk
show interface ethernet 1/1/41
show interface ethernet 1/1/42
Arista:
# Show trunk ports
show interface trunk
# Show specific interface
show interface ethernet 49
show interface ethernet 50
Checking Link Aggregation (LAG) Status
Brocade LAG Status
# Show all LAG groups
show lag brief
# Show specific LAG details
show lag [lag-id]
# Show which ports are in LAG
show lag | include active
# Check individual LAG port status
show interface ethernet 1/1/41
show interface ethernet 1/1/42
Expected Output When Working:
LAG "brocade-to-arista" (lag-id [X]) has 2 active ports:
ethernet 1/1/41 (40Gb) - Active
ethernet 1/1/42 (40Gb) - Active
Arista Port-Channel Status
# Show port-channel summary
show port-channel summary
# Show specific port-channel
show interface port-channel 1
# Check member interfaces
show interface ethernet 49 port-channel
show interface ethernet 50 port-channel
Expected Output When Working:
Port-Channel1:
Active Ports: 2
Et49: Active
Et50: Active
Protocol: LACP
Monitoring Traffic and Bandwidth
Real-Time Interface Statistics
Brocade:
# Show interface rates
show interface ethernet 1/1/41 | include rate
show interface ethernet 1/1/42 | include rate
# Show all interface statistics
show interface ethernet 1/1/41
# Monitor in real-time (if supported)
monitor interface ethernet 1/1/41
Arista:
# Show interface counters
show interface ethernet 49 counters
# Show interface rates
show interface ethernet 49 | grep rate
# Real-time monitoring
watch 1 show interface ethernet 49 counters rate
Identify Top Talkers
Brocade:
# [TODO: Document method to identify top talkers]
# May require SNMP monitoring or sFlow
Arista:
# Check interface utilization
show interface counters utilization
# If sFlow configured:
# [TODO: Document sFlow commands]
Testing Connectivity
From Your Workstation
Test Management Plane:
# Ping all management interfaces
ping -c 4 192.168.1.20 # Brocade
ping -c 4 192.168.1.21 # Arista
ping -c 4 192.168.1.26 # Aruba
ping -c 4 192.168.1.7 # Mikrotik House
ping -c 4 192.168.1.8 # Mikrotik Shop
# Test wireless bridge latency
ping -c 100 192.168.1.8 | tail -3
Test Server Network (VLAN 100):
# Test Kubernetes nodes
ping -c 4 10.100.0.40 # K8s VIP
ping -c 4 10.100.0.50 # talos-control-1
ping -c 4 10.100.0.51 # talos-control-2
ping -c 4 10.100.0.52 # talos-control-3
Test from Kubernetes Nodes:
# SSH to a Talos node (if enabled) or use kubectl exec
kubectl exec -it -n default <pod-name> -- sh
# Test connectivity
ping 10.150.0.10 # Storage network
ping 10.100.0.1 # Gateway
ping 8.8.8.8 # Internet
MTU Testing (Jumbo Frames)
Test VLAN 150/200 MTU 9000:
# From a host on VLAN 150
ping -M do -s 8972 10.150.0.10
# -M do: Don't fragment
# -s 8972: 8972 + 28 (IP+ICMP headers) = 9000
# If this fails but smaller packets work, MTU is misconfigured
Path Testing
Trace route across networks:
# From your workstation
traceroute 10.100.0.50
# Expected path (if everything is working):
# 1. Local gateway
# 2. Wireless bridge
# 3. Brocade/OPNsense
# 4. Destination
Troubleshooting Procedures
Issue: No Connectivity to Garage Switches
Symptoms:
- Cannot ping/SSH to Brocade (192.168.1.20) or Arista (192.168.1.21)
- Can ping Aruba switch (192.168.1.26)
Diagnosis:
-
Test wireless bridge:
ping 192.168.1.7 # Mikrotik House ping 192.168.1.8 # Mikrotik Shop
- If 192.168.1.7 responds but 192.168.1.8 doesn't: Wireless link down
- If neither respond: Mikrotik issue or config problem
-
Check Aruba-to-Mikrotik connection:
# SSH to Aruba ssh admin@192.168.1.26 # Check port status for Mikrotik connection show interface [TODO: port ID]
Resolution:
If wireless bridge is down:
- Check Mikrotik radios web interface (192.168.1.7, 192.168.1.8)
- Check alignment and signal strength
- Verify power to both radios
- Check for interference (weather, obstacles)
- Emergency: Use physical console access to switches in garage
If Mikrotik is up but switches unreachable:
- Check VLAN 1 configuration on trunk ports
- Verify Mikrotik is not blocking traffic
- Check Brocade port connected to Mikrotik is up
Issue: Kubernetes Pods Can't Access Storage
Symptoms:
- Pods stuck in
ContainerCreating
- PVC stuck in
Pending
- Errors about unable to mount CephFS
Diagnosis:
-
Check Rook/Ceph health:
kubectl -n rook-ceph get cephcluster kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
-
Check network connectivity from Kubernetes nodes to Ceph monitors:
# From a Talos node or debug pod ping 10.150.0.10 # Test VLAN 150 connectivity # Test Ceph monitor port nc -zv <monitor-ip> 6789
-
Verify VLAN 150 MTU:
# Test jumbo frames ping -M do -s 8972 10.150.0.10
-
Check CSI driver logs:
kubectl -n rook-ceph logs -l app=csi-cephfsplugin --tail=100
Resolution:
If MTU mismatch:
- Verify MTU 9000 on all VLAN 150 interfaces
- Check Proxmox bridge MTU settings
- Check switch port MTU configuration
If connectivity issue:
- Check VLAN 150 is properly tagged on trunk ports
- Verify Proxmox host network configuration
- Check Brocade routing for VLAN 150
Issue: Slow Ceph Performance
Symptoms:
- Slow pod startup times
- High I/O latency in applications
- Ceph health warnings about slow ops
Diagnosis:
-
Check Ceph cluster health:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
-
Check network bandwidth utilization:
On Brocade (VLAN 150 - Ceph Public):
# Check 10Gb bonds to Proxmox hosts show interface ethernet 1/1/[TODO: ports] | include rate
On Arista (VLAN 200 - Ceph Cluster):
# Check 40Gb links to Proxmox hosts show interface ethernet [TODO: ports] counters rate
-
Identify bottlenecks:
- Are 10Gb links saturated? (VLAN 150)
- Are 40Gb links saturated? (VLAN 200)
- Is the Brocade-Arista link saturated?
Resolution:
If Brocade-Arista link is bottleneck:
- Primary Issue: Only one 40Gb link active (see below to enable second link)
- Enabling second 40Gb link will double bandwidth to 80Gbps
If MTU not configured:
- Verify MTU 9000 on VLAN 150 and 200
- Check each hop in the path
If switch CPU is high:
- Check for broadcast storms
- Verify STP is working correctly
- Look for loops in topology
Issue: Network Loop / Broadcast Storm
Symptoms:
- Network performance severely degraded
- High CPU usage on switches
- Connectivity flapping
- Massive packet rates on interfaces
Diagnosis:
-
Check for duplicate MAC addresses:
# Brocade show mac-address # Look for same MAC on multiple ports
-
Check STP status:
# Brocade show spanning-tree # Arista show spanning-tree
-
Look for physical loops:
- Review physical topology diagram
- Check for accidental double connections
- Known issue: Brocade-Arista 2x 40Gb links not in LAG
Resolution:
Immediate (Emergency):
-
Disable one link causing loop:
# On Arista (already done in current config) configure interface ethernet 50 shutdown
-
Verify spanning-tree is enabled:
# Brocade show spanning-tree # If not enabled: configure terminal spanning-tree
Permanent Fix:
- Configure proper LAG/port-channel (see section below)
Issue: Proxmox Host Loses Network Connectivity
Symptoms:
- Cannot ping Proxmox host management IP
- VMs on host also offline
- IPMI still accessible
Diagnosis:
-
Access via IPMI console:
# [TODO: Document IPMI access method]
-
Check bond status on Proxmox:
# From Proxmox console ip link show # Check bond interfaces cat /proc/net/bonding/bond0 cat /proc/net/bonding/bond1
-
Check switch ports:
# On Brocade show interface ethernet 1/1/[TODO: ports for this host] show lag [TODO: lag-id for this host]
Resolution:
If bond is down on Proxmox:
- Check physical cables
- Restart networking on Proxmox (WARNING: will disrupt VMs)
- Check switch port status
If ports down on switch:
- Check for error counters
- Re-enable port if administratively down
- Check for physical issues (SFP, cable)
Issue: High Latency Across Wireless Bridge
Symptoms:
- Ping times to garage > 10ms (normally 1-2ms)
- Slow access to services in garage
- Packet loss
Diagnosis:
-
Test latency:
ping -c 100 192.168.1.8 # Look at: # - Average latency # - Packet loss % # - Jitter (variation)
-
Check Mikrotik radio status:
- Access web interface: 192.168.1.7 and 192.168.1.8
- Check signal strength
- Check throughput/bandwidth utilization
- Look for interference
-
Test with iperf:
# On server side (garage) iperf3 -s # On client side (house) iperf3 -c 192.168.1.8 -t 30 # Should see ~1 Gbps
Resolution:
If signal degraded:
- Check for obstructions (trees, weather)
- Check alignment
- Check for interference sources
- Consider backup link or failover
If bandwidth saturated:
- Identify high-bandwidth users/applications
- Implement QoS if available
- Consider upgrade to higher bandwidth link
Emergency Procedures
Complete Network Outage (Wireless Bridge Down)
Impact:
- No remote access to garage infrastructure
- Kubernetes cluster still functions internally
- No internet access from garage
- Management access requires physical presence
Emergency Access Methods:
-
Physical console access:
# [TODO: Document where console cables are stored] # Connect laptop directly to switch console port
-
IPMI access (if VPN or alternative route exists):
# [TODO: Document IPMI network topology]
Restoration Steps:
-
Check Mikrotik radios:
- Physical inspection of both radios
- Power cycle if needed
- Check alignment
-
Temporary workaround:
- [TODO: Document backup connectivity method]
- VPN tunnel over alternative route?
- Temporary cable run?
-
Verify restoration:
ping 192.168.1.8 ping 192.168.1.20 ssh admin@192.168.1.20
Core Switch (Brocade) Failure
Impact:
- Loss of VLAN 150/200 routing
- Kubernetes cluster degraded (storage issues)
- Loss of 10Gb connectivity to Proxmox hosts
Emergency Actions:
-
Do NOT reboot all Proxmox hosts simultaneously
- Cluster may be operational on running workloads
- Storage connections via VLAN 200 through Arista may still work
-
Check Brocade status:
- Physical inspection (power, fans, LEDs)
- Console access
- Review logs
-
If Brocade must be replaced:
- [TODO: Document backup configuration location]
- [TODO: Document restoration procedure]
- [TODO: Document spare hardware location]
Spanning Tree Failure / Network Loop
Impact:
- Network completely unusable
- High CPU on all switches
- Broadcast storm
Emergency Actions:
-
Disconnect Brocade-Arista links:
# On Arista (fastest access if SSH still works) configure interface ethernet 49 shutdown interface ethernet 50 shutdown
-
Or physically disconnect:
- Unplug both 40Gb QSFP+ cables between Brocade and Arista
-
Wait for network to stabilize (30-60 seconds)
-
Reconnect ONE link only:
# On Arista configure interface ethernet 49 no shutdown
-
Verify stability before enabling second link
Accidental Configuration Change
Symptoms:
- Network suddenly degraded after change
- New errors appearing
- Connectivity loss
Emergency Actions:
-
Rollback configuration:
Brocade:
# Show configuration history show configuration # Revert to previous config # [TODO: Document Brocade config rollback method]
Arista:
# Show rollback options show configuration sessions # Rollback to previous configure session rollback <session-name>
-
If rollback not available:
- Reboot switch (loads startup-config)
- WARNING: Brief outage during reboot
Switch Configuration
Configure Brocade-Arista LAG (Fix Loop Issue)
Prerequisites:
- Maintenance window scheduled
- Both 40Gb QSFP+ cables connected and working
- Console access to both switches available
- Configuration backed up
Step 1: Pre-Change Verification
# Verify current state
# On Brocade:
show interface ethernet 1/1/41
show interface ethernet 1/1/42
# On Arista:
show interface ethernet 49
show interface ethernet 50 # Currently disabled
# Document current traffic levels
show interface ethernet 1/1/41 | include rate
Step 2: Configure Brocade LAG
# SSH to Brocade
ssh admin@192.168.1.20
# Enter configuration mode
enable
configure terminal
# Create LAG
lag brocade-to-arista dynamic id [TODO: Choose available LAG ID, e.g., 10]
ports ethernet 1/1/41 to 1/1/42
primary-port 1/1/41
lacp-timeout short
deploy
exit
# Configure VLAN on LAG
vlan 1
tagged lag [LAG-ID]
exit
vlan 100
tagged lag [LAG-ID]
exit
vlan 150
tagged lag [LAG-ID]
exit
vlan 200
tagged lag [LAG-ID]
exit
# Apply to interfaces
interface ethernet 1/1/41
link-aggregate active
exit
interface ethernet 1/1/42
link-aggregate active
exit
# Save configuration
write memory
# Verify
show lag brief
show lag [LAG-ID]
Step 3: Configure Arista Port-Channel
# SSH to Arista
ssh admin@192.168.1.21
# Enter configuration mode
enable
configure
# Create port-channel
interface Port-Channel1
description Link to Brocade ICX6610
switchport mode trunk
switchport trunk allowed vlan 1,100,150,200
exit
# Add member interfaces
interface Ethernet49
description Brocade 40G Link 1
channel-group 1 mode active
lacp rate fast
exit
interface Ethernet50
description Brocade 40G Link 2
channel-group 1 mode active
lacp rate fast
exit
# Save configuration
write memory
# Verify
show port-channel summary
show interface Port-Channel1
show lacp neighbor
Step 4: Verify Configuration
# On Brocade:
show lag [LAG-ID]
# Should show: 2 ports active
show lacp
# Should show: Negotiated with neighbor
# On Arista:
show port-channel summary
# Should show: Po1(U) with Et49(P), Et50(P)
show lacp neighbor
# Should show: Brocade as partner
# Test traffic balancing
show interface Port-Channel1 counters
show interface ethernet 49 counters
show interface ethernet 50 counters
# Both Et49 and Et50 should show traffic
Step 5: Monitor for Issues
# Watch for 15 minutes
# On Arista:
watch 10 show port-channel summary
# Check for errors
show logging | grep -i Port-Channel1
# Monitor CPU (should be normal)
show processes top
Rollback Plan (if issues occur):
# On Arista (fastest to disable)
configure
interface ethernet 50
shutdown
# On Brocade (if needed)
configure terminal
no lag brocade-to-arista
interface ethernet 1/1/41
no link-aggregate
interface ethernet 1/1/42
no link-aggregate
Adding a New VLAN
Example: Adding VLAN 300 for IoT devices
Step 1: Plan VLAN
- VLAN ID: 300
- Network: [TODO: e.g., 10.30.0.0/24]
- Gateway: [TODO: Which device?]
- Required on: [TODO: Which switches/trunks?]
Step 2: Create VLAN on Brocade
ssh admin@192.168.1.20
enable
configure terminal
# Create VLAN
vlan 300
name IoT-Network
tagged ethernet 1/1/41 to 1/1/42 # Trunk to Arista
tagged ethernet 1/1/[TODO] # Trunk to Mikrotik
untagged ethernet 1/1/[TODO] # Access ports if needed
exit
# If Brocade is gateway:
interface ve 300
ip address [TODO: IP]/24
exit
# Save
write memory
Step 3: Add to other switches as needed
# On Arista:
configure
vlan 300
name IoT-Network
exit
interface Port-Channel1
switchport trunk allowed vlan add 300
exit
write memory
Configuring Jumbo Frames (MTU 9000)
For VLAN 150 and 200 (Ceph networks)
On Brocade:
# Configure MTU on VLAN interfaces
interface ve 150
mtu 9000
exit
interface ve 200
mtu 9000
exit
# Configure MTU on physical/LAG interfaces
interface ethernet 1/1/[TODO: storage network ports]
mtu 9000
exit
write memory
On Arista:
# Configure MTU on interfaces carrying VLAN 150/200
interface ethernet [TODO: ports]
mtu 9216 # 9000 + overhead
exit
write memory
Verify MTU:
# From Talos node
ping -M do -s 8972 10.150.0.10
# Should succeed without fragmentation
Performance Monitoring
Key Metrics to Monitor
Switch Health:
- CPU utilization (should be <30% normally)
- Memory utilization (should be <70%)
- Temperature (within operating range)
- Power supply status
Interface Health:
- Error counters (input/output errors)
- CRC errors
- Interface resets
- Utilization percentage
Traffic Patterns:
- Bandwidth utilization per interface
- Top talkers per VLAN
- Broadcast/multicast rates
Setting Up Monitoring
[TODO: Document monitoring setup]
Options:
- SNMP monitoring to Prometheus
- sFlow for traffic analysis
- Switch logging to Loki
- Grafana dashboards
Example Prometheus Targets:
# [TODO: Example prometheus config for SNMP exporter]
Baseline Performance Metrics
Normal Operating Conditions:
Metric | Expected Value | Alert Threshold |
---|---|---|
Wireless Bridge Latency | 1-2ms | > 5ms |
Wireless Bridge Loss | 0% | > 1% |
Brocade CPU | < 20% | > 60% |
Arista CPU | < 15% | > 50% |
40Gb Link Utilization | < 50% | > 80% |
10Gb Link Utilization | < 60% | > 85% |
[TODO: Add baseline measurements]
Maintenance Windows
Pre-Maintenance Checklist
Before any network maintenance:
- Schedule maintenance window
- Notify all users
-
Back up switch configurations
# Brocade show running-config > backup-$(date +%Y%m%d).cfg # Arista show running-config > backup-$(date +%Y%m%d).cfg
- Document current state
- Have rollback plan ready
- Ensure console access available
- Test backup connectivity method
Post-Maintenance Checklist
After any network maintenance:
-
Verify all links are up
show interface brief show lag brief # Brocade show port-channel summary # Arista
-
Check for errors
show logging | include error
- Test connectivity to all VLANs
- Monitor for 30 minutes for issues
- Update documentation with any changes
-
Save configurations
write memory
Regular Maintenance Tasks
Weekly:
- Review switch logs for errors/warnings
- Check interface error counters
- Verify wireless bridge performance
Monthly:
- Review bandwidth utilization trends
- Check for firmware updates
- Verify backup configurations are current
Quarterly:
- Review and update network documentation
- Test emergency procedures
- Review and optimize switch configurations
Configuration Backup
Backing Up Switch Configurations
Brocade ICX6610:
# Method 1: Copy to TFTP server
copy running-config tftp [TODO: TFTP server IP] brocade-backup-$(date +%Y%m%d).cfg
# Method 2: Display and save manually
show running-config > /tmp/brocade-config.txt
# [TODO: Document automated backup method]
Arista 7050:
# Show running config
show running-config
# Copy to USB (if available)
copy running-config usb:/brocade-backup-$(date +%Y%m%d).cfg
# [TODO: Document automated backup method]
Storage Location:
- [TODO: Document where configurations are backed up]
- Consider: Git repository for version control
- Consider: Automated daily backups via Ansible
Restoring Configurations
Brocade:
# Load config from file
copy tftp running-config [TFTP-IP] [filename]
# Or manually paste config
configure terminal
# Paste configuration
Arista:
# Copy config from file
copy usb:/backup.cfg running-config
# Or configure manually
configure
# Paste configuration
Security Considerations
Access Control
[TODO: Document security policies]
- Who has access to switch management?
- How are credentials managed?
- Is 2FA available/configured?
- Are management VLANs isolated?
Security Best Practices
- Change default passwords
- Disable unused ports
- Enable port security where appropriate
- Configure DHCP snooping
- Enable storm control
- Regular firmware updates
- Monitor for unauthorized devices
Useful Commands Reference
Brocade ICX6610 Quick Reference
# Basic show commands
show version
show running-config
show interface brief
show vlan
show lag
show mac-address
show spanning-tree
show log
# Interface management
interface ethernet 1/1/1
enable
disable
description [text]
# Save configuration
write memory
Arista 7050 Quick Reference
# Basic show commands
show version
show running-config
show interfaces status
show vlan
show port-channel summary
show mac address-table
show spanning-tree
show logging
# Interface management
configure
interface ethernet 1
shutdown
no shutdown
description [text]
# Save configuration
write memory
Contacts and Escalation
[TODO: Fill in contact information]
Role | Name | Contact | Escalation Level |
---|---|---|---|
Primary Network Admin | [TODO] | [TODO] | 1 |
Secondary Contact | [TODO] | [TODO] | 2 |
Vendor Support - Brocade | [TODO] | [TODO] | 3 |
Vendor Support - Arista | [TODO] | [TODO] | 3 |
Change Log
Date | Change | Person | Impact | Notes |
---|---|---|---|---|
2025-10-14 | Initial runbook created | Claude | None | Baseline documentation |
[TODO] | [TODO] | [TODO] | [TODO] | [TODO] |
References
- Network Topology Documentation
- Storage Architecture - For Ceph network details
- Brocade ICX6610 Documentation: [TODO: Link]
- Arista 7050 Documentation: [TODO: Link]
- [TODO: Add other relevant documentation links]
40GB Ceph Storage Network Configuration
Project Overview
This project configures the 40GB network infrastructure for Ceph cluster storage (VLAN 200), providing dedicated high-bandwidth links for Ceph OSD replication traffic. The configuration enables 40Gbps connectivity for three Proxmox hosts with routing support for the fourth host through a Brocade-Arista link aggregation.
Key Benefits:
- 🚀 4x bandwidth increase for hosts with 40Gb links (10Gb → 40Gb)
- 🔗 80Gbps aggregated bandwidth between Brocade and Arista switches
- 📦 Jumbo frame support (MTU 9000) for improved Ceph performance
- 🔀 No bottlenecks for mixed-speed traffic (10Gb and 40Gb hosts)
- 🛡️ Resilient design with LACP link aggregation and failover
Project Status: ✅ Ready for Implementation
Architecture Summary
Current State (Before Configuration)
VLAN 200 Traffic:
- All Proxmox hosts using 2x 10Gb bonds to Brocade
- Shared bandwidth with VLAN 100, 150
- MTU 1500 (no jumbo frames)
- Brocade-Arista: Only 1x 40Gb link active (2nd disabled due to loop)
Limitations:
- Ceph replication limited to ~9-10 Gbps per host
- Contention with other VLAN traffic
- No jumbo frame support
Target State (After Configuration)
VLAN 200 Traffic:
- Proxmox-01, 02, 04: Dedicated 40Gb links to Arista
- Proxmox-03: 10Gb bond to Brocade (no 40Gb link available)
- Brocade-Arista: 2x 40Gb LAG (80Gbps total)
- MTU 9000 throughout the path
- Layer 3 routing on Brocade for Proxmox-03
Benefits:
- Proxmox-01, 02, 04: 40 Gbps dedicated Ceph bandwidth
- Proxmox-03: 10 Gbps Ceph bandwidth (no bottleneck at switches)
- Direct switching for 40Gb hosts (no routing latency)
- Optimized for large Ceph object transfers (jumbo frames)
Network Topology
Physical Connections
┌─────────────┐
│ Proxmox-01 │───────40Gb───────┐
│ (10.200.0.1)│ │
└─────────────┘ │
│
┌─────────────┐ │
│ Proxmox-02 │───────40Gb───────┤
│ (10.200.0.2)│ │
└─────────────┘ │ ┌──────────────┐
├────│ Arista 7050 │
┌─────────────┐ │ │ │
│ Proxmox-04 │───────40Gb───────┤ │ Et27: Px-01 │
│ (10.200.0.4)│ │ │ Et28: Px-02 │
└─────────────┘ │ │ Et29: Px-04 │
│ └──────┬───────┘
│ │
┌─────────────┐ │ 2x 40Gb LAG
│ Proxmox-03 │─┐ │ (Port-Channel1)
│ (10.200.0.3)│ │ │ │
└─────────────┘ │ │ ┌──────┴───────┐
│ │ │ Brocade 6610 │
2x 10Gb bond │ │ │
(LAG to Brocade) │ │ LAG 11: 80Gb │
│ │ │ VE200: GW │
└────────────────┘ └──────────────┘
(10.200.0.254)
Traffic Paths
Direct 40Gb Paths (no routing):
- Proxmox-01 ↔ Proxmox-02: Via Arista (switched)
- Proxmox-01 ↔ Proxmox-04: Via Arista (switched)
- Proxmox-02 ↔ Proxmox-04: Via Arista (switched)
Routed Paths (through Brocade):
- Proxmox-03 ↔ Proxmox-01/02/04:
- Px-03 → 10Gb bond → Brocade VE200 (Layer 3 routing)
- Brocade → 80Gb LAG → Arista
- Arista → 40Gb → Px-01/02/04
- Bandwidth: Limited to 10Gb at Px-03, but LAG prevents bottleneck
Configuration Phases
This project is divided into 6 phases that must be completed in order:
Phase 1: Configure Brocade-Arista 40GB LAG ⚙️
Enable the second 40Gb link between Brocade and Arista by configuring LACP link aggregation.
Key Tasks:
- Create LAG on Brocade (LAG ID 11)
- Create Port-Channel on Arista (Port-Channel1)
- Enable both 40Gb links (Et25, Et26 on Arista)
- Configure LACP with fast timeout
Outcome: 80Gbps bandwidth between switches, eliminating potential bottleneck.
Phase 2: Configure VLAN 200 Routing on Brocade 🌐
Configure Layer 3 routing on Brocade for Proxmox-03 to reach other hosts via VLAN 200.
Key Tasks:
- Create VLAN interface VE200 on Brocade (10.200.0.254/24)
- Set MTU 9000 on VE200
- Configure default route on Proxmox-03 pointing to Brocade
Outcome: Proxmox-03 can route to all other hosts on VLAN 200 through Brocade.
Phase 3: Configure Arista VLAN 200 with Jumbo Frames 📦
Enable jumbo frame support (MTU 9216) on all Arista interfaces carrying VLAN 200 traffic.
Key Tasks:
- Set MTU 9216 on Port-Channel1 (Brocade link)
- Set MTU 9216 on Et27, Et28, Et29 (Proxmox links)
- Verify VLAN 200 trunk configuration
Outcome: Arista switch supports jumbo frames for optimal Ceph performance.
Phase 4: Identify 40GB NICs on Proxmox Hosts 🔍
Identify which physical network interfaces are the 40Gb NICs on each Proxmox host and map them to Arista switch ports.
Key Tasks:
- Use ethtool/LLDP/link flap to identify 40Gb interfaces
- Document interface names (e.g., enp2s0, enp7s0)
- Map Proxmox hosts to Arista ports (Et27, Et28, Et29)
- Record MAC addresses for tracking
Outcome: Complete mapping of Proxmox hosts to Arista ports with interface names documented.
Phase 5: Reconfigure Proxmox Hosts to Use 40GB Links 🖥️
Reconfigure Proxmox hosts to move VLAN 200 from 10Gb bonds to dedicated 40Gb interfaces.
Key Tasks:
- Edit
/etc/network/interfaces
on each host - Move vmbr200 from bond1 to 40Gb interface (with VLAN 200 tag)
- Set MTU 9000 on all interfaces
- Test connectivity after each host (one at a time!)
Outcome: Proxmox-01, 02, 04 using 40Gb links; Proxmox-03 using 10Gb bond with routing.
⚠️ Important: Configure hosts one at a time to minimize Ceph disruption!
Phase 6: Testing and Verification ✅
Comprehensive testing to validate the configuration and measure performance improvements.
Test Categories:
- Connectivity Tests: Ping between all hosts, ARP resolution
- MTU Tests: Jumbo frame validation (ping -M do -s 8972)
- Bandwidth Tests: iperf3 throughput measurements
- Ceph Performance: OSD network tests, rebalance speed
- Switch Verification: LAG status, traffic distribution, error counters
- Failover Tests: LAG member failure, host failure scenarios
Outcome: Validated 4x performance improvement with comprehensive test results.
Quick Start Guide
Prerequisites
Before starting, ensure you have:
- SSH access to all switches (Brocade, Arista)
- SSH/IPMI access to all Proxmox hosts
- Backup of all current configurations
- Maintenance window scheduled (recommended: 2-4 hours)
- Console access ready (IPMI) in case of network issues
- This documentation downloaded and available offline
Execution Order
Follow phases in strict order:
- Phase 1 (30 min): Configure switch LAG - minimal disruption
- Phase 2 (15 min): Add Brocade routing - no disruption to existing hosts
- Phase 3 (15 min): Set MTU on Arista - minimal disruption
- Phase 4 (30 min): Identify interfaces - read-only, no disruption
- Phase 5 (60 min): Reconfigure hosts - brief Ceph disruption per host
- Phase 6 (60+ min): Testing - monitor for issues
Total time: 3-4 hours (including testing)
Rollback Strategy
Each phase includes a rollback procedure. Key rollback points:
- Phase 1: Disable 2nd 40Gb link on Arista (immediate)
- Phase 2: Disable VE200 on Brocade (immediate)
- Phase 3: Revert MTU on Arista (immediate)
- Phase 5: Restore
/etc/network/interfaces
backup on each host
Critical: Keep one SSH session open to each device before making changes!
Configuration Reference
IP Addressing (VLAN 200)
Host | Management IP | VLAN 200 IP | Interface | Link Speed | Connected To |
---|---|---|---|---|---|
Proxmox-01 | 192.168.1.62 | 10.200.0.1/24 | enp2s0.200 | 40Gb | Arista Et27 |
Proxmox-02 | 192.168.1.63 | 10.200.0.2/24 | enp7s0.200 | 40Gb | Arista Et28 |
Proxmox-03 | 192.168.1.64 | 10.200.0.3/24 | bond1.200 | 2x 10Gb | Brocade LAG 2 |
Proxmox-04 | 192.168.1.66 | 10.200.0.4/24 | enp7s0.200 | 40Gb | Arista Et29 |
Brocade | 192.168.1.20 | 10.200.0.254/24 | VE200 | - | Gateway |
Note: Interface names may vary - update based on Phase 4 findings.
Switch Configuration Summary
Brocade ICX6610:
- LAG 11: "brocade-to-arista" (ports 1/2/1, 1/2/6)
- LACP dynamic, short timeout
- VLANs 1, 20, 100, 150, 200 tagged
- MTU 9000
- VLAN 200:
- VE200: 10.200.0.254/24, MTU 9000
- Routing enabled
Arista 7050:
- Port-Channel1: (Et25, Et26)
- LACP active, fast rate
- Trunk: VLANs 1, 20, 100, 150, 200
- MTU 9216
- Et27, Et28, Et29:
- Trunk mode
- VLAN 200 tagged
- MTU 9216
MTU Configuration
Device | Interface | MTU | Notes |
---|---|---|---|
Brocade | VE200 | 9000 | VLAN 200 gateway |
Brocade | LAG 11 | 9000 | To Arista |
Arista | Port-Channel1 | 9216 | From Brocade |
Arista | Et27-29 | 9216 | To Proxmox |
Proxmox | 40Gb NIC | 9000 | Physical interface |
Proxmox | VLAN interface | 9000 | enp*.200 |
Proxmox | vmbr200 | 9000 | Bridge |
Why 9216 on Arista? Arista requires extra overhead for VLAN tagging (9000 + 216 = 9216).
Expected Performance
Bandwidth Improvements
Connection | Before (10Gb) | After (40Gb) | Improvement |
---|---|---|---|
Px-01 ↔ Px-02 | ~9 Gbps | ~35-38 Gbps | 4.0x |
Px-01 ↔ Px-04 | ~9 Gbps | ~35-38 Gbps | 4.0x |
Px-02 ↔ Px-04 | ~9 Gbps | ~35-38 Gbps | 4.0x |
Px-03 ↔ Others | ~9 Gbps | ~9-10 Gbps | 1.0x* |
* Proxmox-03 limited by 10Gb uplink, but no switch bottleneck due to 80Gb LAG.
Latency Improvements
Path | Expected RTT | Notes |
---|---|---|
40Gb direct | < 0.5 ms | Switched only (no routing) |
Via Brocade | < 1.5 ms | One Layer 3 hop |
Before (10Gb shared) | 1-2 ms | Shared bandwidth, potential queuing |
Ceph Performance
Expected improvements:
- Rebalance speed: 4x faster on hosts with 40Gb links
- OSD recovery: Significantly reduced time for large objects
- Client I/O: Reduced latency for Kubernetes pods using RBD/CephFS
- Concurrent operations: Better performance under load
Troubleshooting
Common Issues
Issue: LAG/Port-Channel won't form
- Check: LACP mode (both sides "active"), VLAN configuration matches
- Verify: Physical links are up, no cable/SFP issues
- Ref: Phase 1 Troubleshooting
Issue: Jumbo frames don't work
- Check: MTU on every hop (host → switch → switch → host)
- Test:
ping -M do -s 8972 <destination>
- Ref: Phase 3 Troubleshooting
Issue: Proxmox-03 can't reach other hosts
- Check: VE200 is up, routing table on Px-03, gateway configured
- Verify: LAG 11 is up between Brocade and Arista
- Ref: Phase 2 Troubleshooting
Issue: Lower than expected bandwidth
- Check: CPU usage during iperf3, network card offloading settings
- Test: Multi-stream iperf3 (
-P 10
) - Ref: Phase 6 Troubleshooting
Emergency Contacts
Role | Responsibility | Contact |
---|---|---|
Network Admin | Switch configuration | [FILL IN] |
System Admin | Proxmox hosts | [FILL IN] |
Storage Admin | Ceph cluster | [FILL IN] |
Maintenance and Operations
Regular Checks
Weekly:
- Monitor switch logs for errors
- Check LAG/Port-Channel status
- Review interface error counters
Monthly:
- Verify bandwidth utilization trends
- Test failover procedures
- Review and update documentation
Quarterly:
- Full connectivity and performance test
- Review configurations for optimization
- Plan for firmware updates
Backup Procedures
Switch Configurations:
# Brocade
show running-config > brocade-backup-$(date +%Y%m%d).txt
# Arista
show running-config > arista-backup-$(date +%Y%m%d).txt
Proxmox Network Configs:
# On each host
cp /etc/network/interfaces /etc/network/interfaces.backup-$(date +%Y%m%d)
Store backups in: /home/derek/projects/dapper-cluster/docs/src/operations/network-configs/backups/
Future Enhancements
Potential Improvements:
- Add 40Gb link to Proxmox-03 (hardware upgrade)
- Implement network monitoring with Prometheus SNMP exporter
- Configure sFlow for traffic analysis
- Set up automated configuration backups (Ansible)
- Add redundant Brocade-Arista links (if needed)
- Upgrade to 100Gb links (future-proofing)
Success Metrics
Technical Metrics
- ✅ Bandwidth: 40Gb links achieve 35+ Gbps throughput
- ✅ Latency: < 1ms RTT between hosts on Arista
- ✅ MTU: 9000 byte frames work end-to-end
- ✅ Availability: 99.9%+ uptime (LAG provides redundancy)
- ✅ Ceph: Rebalance speed increased by 4x
Operational Metrics
- ✅ Deployment Time: < 4 hours total
- ✅ Downtime: < 5 minutes per host (during reconfiguration)
- ✅ Documentation: Complete and accurate
- ✅ Rollback: < 2 minutes to revert changes
- ✅ Team Readiness: All staff trained on new configuration
References
Internal Documentation
Phase Documentation
- Phase 1: Brocade-Arista LAG
- Phase 2: Brocade VLAN 200 Routing
- Phase 3: Arista Jumbo Frames
- Phase 4: Identify 40GB NICs
- Phase 5: Reconfigure Proxmox
- Phase 6: Testing & Verification
Vendor Documentation
- Brocade ICX6610 Command Reference
- Arista 7050 Configuration Guide
- Proxmox Network Configuration
- Ceph Network Recommendations
Project History
Date | Milestone | Status |
---|---|---|
2025-10-14 | Project planning and documentation | ✅ Complete |
TBD | Phase 1: Brocade-Arista LAG | 📋 Ready |
TBD | Phase 2: Brocade routing | 📋 Ready |
TBD | Phase 3: Arista MTU | 📋 Ready |
TBD | Phase 4: Interface identification | 📋 Ready |
TBD | Phase 5: Proxmox reconfiguration | 📋 Ready |
TBD | Phase 6: Testing | 📋 Ready |
TBD | Project completion | ⏳ Pending |
License and Credits
Documentation created by: Claude (Anthropic AI Assistant) Date: October 14, 2025 Version: 1.0 Project: Dapper Cluster 40GB Storage Network Upgrade
Contributors:
- Network architecture review and validation
- Configuration procedures and best practices
- Testing methodology and verification procedures
Ready to begin? Start with Phase 1: Configure Brocade-Arista LAG
Monitoring Stack Gap Analysis - October 17, 2025
Executive Summary
Comprehensive review of Grafana, Prometheus, and Loki monitoring stack revealed the core components are functional with 97.6% operational status. Critical issues identified require both Kubernetes configuration changes and external Ceph infrastructure remediation.
Component Status
✅ Grafana (Healthy)
- Status: Running (2/2 containers)
- Memory: 441Mi
- URL: grafana.chelonianlabs.com
- Datasources: Properly configured
- Prometheus:
http://prometheus-operated.observability.svc.cluster.local:9090
- Loki:
http://loki-headless.observability.svc.cluster.local:3100
- Alertmanager:
http://alertmanager-operated.observability.svc.cluster.local:9093
- Prometheus:
- Dashboards: 35+ configured and loading
- Issues: None
✅ Prometheus (Healthy with Minor Issues)
- Status: Running HA mode (2 replicas)
- Memory: 2.1GB per pod
- Scrape Success: 161/165 targets healthy (97.6%)
- Storage: 5.8GB/100GB used (6%)
- Retention: 14 days
- Monitoring Coverage:
- 38 ServiceMonitors
- 7 PodMonitors
- 44 PrometheusRules
- Issues:
- 4 targets down (2.4% failure rate)
- Duplicate timestamp warnings from kube-state-metrics
⚠️ Loki (Functional but Dropping Logs)
- Status: Running (2/2 containers)
- Memory: 340Mi
- Storage: 1.6GB/30GB used (5%)
- Retention: 14 days
- Log Collection: Successfully collecting from 17 namespaces
- Issues:
- CRITICAL: Max entry size limit (256KB) exceeded
- Plex logs (553KB entries) being rejected
- Error:
Max entry size '262144' bytes exceeded
✅ Promtail (Healthy)
- Status: DaemonSet running on all 11 nodes
- Memory: 70-140Mi per pod
- Target:
http://loki-headless.observability.svc.cluster.local:3100/loki/api/v1/push
- Issues: None (successfully shipping logs despite Loki rejections)
⚠️ Alertmanager (Healthy but Active Alerts)
- Status: Running (2/2 containers)
- Memory: 37Mi
- Active Alerts: 19 alerts firing
- Issues: See Active Alerts section below
Critical Issues
1. Loki Log Entry Size Limit
Severity: High Impact: Logs from high-volume applications being dropped
Details:
- Default max entry size: 262,144 bytes (256KB)
- Plex application producing 553KB log entries
- Logs silently dropped without alerting
Fix Applied:
- ✅ Updated
/kubernetes/apps/observability/loki/app/helmrelease.yaml
- Added
limits_config.max_line_size: 1048576
(1MB) - Action Required: Commit and push to trigger Flux reconciliation
Verification:
# After deployment, verify no more errors:
kubectl logs -n observability -l app.kubernetes.io/name=promtail --tail=100 | grep "exceeded"
2. External Ceph Cluster Health Warnings
Severity: High Impact: PVC provisioning failures, pod scheduling blocked
Details:
External Ceph cluster (running on Proxmox hosts) showing HEALTH_WARN
:
-
PG_AVAILABILITY (Critical):
- 128 placement groups inactive
- 128 placement groups incomplete
- This is blocking new PVC creation
-
MDS_SLOW_METADATA_IO:
- 1 MDS (metadata server) reporting slow I/O
- Impacts CephFS performance
-
MDS_TRIM:
- 1 MDS behind on trimming
- Can impact metadata operations
Ceph Cluster Info:
- FSID:
782dd297-215e-4c35-b7cf-659c20e6909e
- Version: 18.2.7-0 (Reef)
- Monitors: proxmox-02 (10.150.0.2), proxmox-03 (10.150.0.3), Proxmox-04 (10.150.0.4)
- Capacity: 195TB available / 244TB total (80% used)
Action Required: These are infrastructure-level issues that must be resolved on the Proxmox/Ceph cluster directly:
# SSH to Proxmox host and run:
ceph health detail
ceph pg dump | grep -E "inactive|incomplete"
ceph osd tree
ceph fs status cephfs
# Likely fixes (depending on root cause):
# - Check OSD status and bring up any down OSDs
# - Verify network connectivity between OSDs
# - Check disk space on OSD nodes
# - Review Ceph logs for specific PG issues
Kubernetes Impact:
- ❌ Gatus pod stuck in Pending (PVC provisioning failure)
- ❌ VolSync destination pods failing
- ❌ Any new workloads requiring CephFS storage blocked
Prometheus Scrape Target Failures
Down Targets (4 total):
- athena.manor:9221 - Unnamed exporter (likely SNMP)
- circe.manor:9221 - Unnamed exporter (likely SNMP)
- nut-upsd.kube-system.svc.cluster.local:3493 - NUT UPS exporter
- zigbee-controller-garage.manor - Zigbee controller
Analysis: All down targets are edge devices or external services. Core Kubernetes monitoring intact.
Recommended Actions:
- Verify network connectivity to .manor hostnames
- Check if SNMP exporters are running
- Investigate NUT UPS service in kube-system namespace
- Verify zigbee-controller service status
Active Alerts (19 Total)
High Priority:
- TargetDown - Related to 4 targets listed above
- KubePodNotReady - Related to Ceph PVC provisioning issues (gatus, volsync)
- KubeDeploymentRolloutStuck - Likely gatus deployment
- KubePersistentVolumeFillingUp - Check which PVs
Medium Priority:
- CPUThrottlingHigh - Investigate which pods/namespaces
- KubeJobFailed - 2 failed jobs identified:
kometa-29344680
(media namespace)plex-off-deck-29344620
(media namespace)
- VolSyncVolumeOutOfSync - Expected with current Ceph issues
Informational:
- Watchdog - Always firing (heartbeat)
- PrometheusDuplicateTimestamps - kube-state-metrics timing issue (low impact)
Recommendations
Immediate Actions (Required before further work):
- ✅ Loki configuration updated - Ready for commit
- ⚠️ Fix Ceph PG issues - Must be done on Proxmox hosts
- ⚠️ Verify Ceph health - Run
ceph health detail
on Proxmox
Post-Ceph Fix:
-
Delete stuck pods to retry provisioning:
kubectl delete pod -n observability gatus-6fcfb64bc8-zz996 kubectl delete pod -n observability volsync-dst-gatus-dst-8wvtx
-
Investigate and fix down Prometheus targets:
- Check SNMP exporter configurations
- Verify NUT UPS service
- Test network connectivity to .manor devices
-
Review CPU throttling alerts:
kubectl top pods -A --sort-by=cpu # Adjust resource limits as needed
-
Clean up failed CronJobs in media namespace
Long-term Improvements:
- Add Loki ingestion metrics dashboard
- Configure log sampling/filtering for high-volume apps
- Set up PVC capacity monitoring alerts
- Review and tune Prometheus scrape intervals
- Consider adding CephFS-specific dashboards
Verification Checklist
After applying fixes:
- Loki accepting large log entries (check Promtail logs)
- No "exceeded" errors in Promtail logs
-
Ceph cluster shows
HEALTH_OK
- Gatus pod Running (2/2)
- All PVCs Bound
- Prometheus targets down count <= 2 (excluding optional edge devices)
- Active alerts reduced to baseline (~5-10 expected)
- All core namespace pods Running
Infrastructure Context
Deployment Method:
- GitOps: FluxCD
- Workflow: Edit repo → User commits → User pushes → Flux reconciles
Storage:
- Provider: External Ceph cluster (Proxmox)
- Storage Classes: cephfs-shared (default), cephfs-static
- Provisioner: rook-ceph.cephfs.csi.ceph.com
Monitoring Namespace:
- Namespace: observability
- Components: Grafana, Prometheus (HA), Loki, Promtail, Alertmanager
- Additional: VPA, Goldilocks, Gatus, Kromgo, various exporters
Next Steps
- User Action: Review and commit Loki configuration changes
- User Action: Fix Ceph PG availability issues on Proxmox
- After Ceph Fix: Proceed with pod cleanup and target investigations
- Monitor: Watch for new alerts or recurring issues
Generated: 2025-10-17 Analysis Duration: ~30 minutes Status: Awaiting user commit and Ceph infrastructure remediation
Ceph RBD Storage Migration Candidates
Analysis performed: 2025-10-17
Overview
This document identifies workloads in the cluster that would benefit from migrating to ceph-rbd (Ceph block storage) instead of cephfs-shared (CephFS shared filesystem).
Key Principle: Databases, time-series stores, and stateful services requiring high I/O performance should use block storage (RBD). Shared files, media libraries, and backups should use filesystem storage (CephFS).
Current Status
Already Using ceph-rbd ✓
- PostgreSQL (CloudNativePG) - 20Gi data + 5Gi WAL
Storage Classes Available
ceph-rbd
- Block storage (RWO) - Best for databasescephfs-shared
- Shared filesystem (RWX) - Best for shared files/mediacephfs-static
- Static CephFS volumes
Storage Configuration Patterns
Before migrating workloads, it's important to understand how PVCs are created in this cluster:
Pattern 1: Volsync Component Pattern (Most Apps)
Used by: 41+ applications including all media apps, self-hosted apps, home automation, AI apps
How it works:
-
Application's
ks.yaml
includes the volsync component:components: - ../../../../flux/components/volsync
-
PVC is created by the volsync component template (
flux/components/volsync/pvc.yaml
) -
Storage configuration is set via
postBuild.substitute
in theks.yaml
:postBuild: substitute: APP: prowlarr VOLSYNC_CAPACITY: 5Gi VOLSYNC_STORAGECLASS: cephfs-shared # Default if not specified VOLSYNC_ACCESSMODES: ReadWriteMany # Default if not specified VOLSYNC_SNAPSHOTCLASS: cephfs-snapshot # Default if not specified
Default values:
- Storage Class:
cephfs-shared
- Access Modes:
ReadWriteMany
- Snapshot Class:
cephfs-snapshot
Examples:
- Prowlarr:
kubernetes/apps/media/prowlarr/ks.yaml
- Obsidian CouchDB:
kubernetes/apps/selfhosted/obsidian-couchdb/ks.yaml
- Most workloads with < 100Gi storage needs
Pattern 2: Direct HelmRelease Pattern
Used by: Large observability workloads (Prometheus, Loki, AlertManager)
How it works:
- Storage is defined directly in the HelmRelease values
- No volsync component used
- PVC created by Helm chart templates
Example (Prometheus):
# kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: cephfs-shared
resources:
requests:
storage: 100Gi
Examples:
- Prometheus:
kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
- Loki:
kubernetes/apps/observability/loki/app/helmrelease.yaml
- AlertManager:
kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
Migration Candidates
🔴 HIGH Priority - Data Durability Risk
1. Dragonfly Redis
- Namespace: database
- Current Storage: NONE (ephemeral, in-memory only)
- Current Size: N/A (data lost on restart)
- Replicas: 3
- Recommended: Add ceph-rbd PVCs (~10Gi each for snapshots/persistence)
- Why: Redis alternative running in cluster mode needs persistent snapshots for:
- Data durability across restarts
- Cluster state recovery
- Snapshot-based backups
- Impact: HIGH - Currently losing all data on pod restart
- Config Location:
kubernetes/apps/database/dragonfly-redis/cluster/cluster.yaml
- Migration Complexity: Medium - requires modifying Dragonfly CRD to add volumeClaimTemplates
2. EMQX MQTT Broker
- Namespace: database
- Current Storage: NONE (emptyDir, ephemeral)
- Current Size: N/A (data lost on restart)
- Replicas: 3 (StatefulSet)
- Recommended: Add ceph-rbd PVCs (~5-10Gi each for session/message persistence)
- Why: MQTT brokers need persistent storage for:
- Retained messages
- Client subscriptions
- Session state for QoS > 0
- Cluster configuration
- Impact: HIGH - Currently losing retained messages and sessions on restart
- Config Location:
kubernetes/apps/database/emqx/cluster/cluster.yaml
- Migration Complexity: Medium - requires modifying EMQX CRD to add persistent volumes
🟡 MEDIUM Priority - Performance & Best Practices
3. CouchDB (obsidian-couchdb)
- Namespace: selfhosted
- Current Storage: cephfs-shared
- Current Size: 5Gi
- Replicas: 1 (Deployment)
- Storage Pattern: ✅ Volsync Component (
kubernetes/apps/selfhosted/obsidian-couchdb/ks.yaml
) - Recommended: Migrate to ceph-rbd
- Why: NoSQL database benefits from:
- Better I/O performance for document reads/writes
- Improved fsync performance for data integrity
- Block-level snapshots for consistent backups
- Impact: Medium - requires backup, PVC migration, restore
- Migration Complexity: Medium - GitOps workflow with volsync pattern
- Update ks.yaml postBuild substitutions
- Commit and push changes
- Flux recreates PVC with new storage class
- Volsync handles data restoration
4. Prometheus
- Namespace: observability
- Current Storage: cephfs-shared
- Current Size: 2x100Gi (200Gi total across 2 replicas)
- Replicas: 2 (StatefulSet)
- Storage Pattern: 🔧 Direct HelmRelease (
kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
) - Recommended: Migrate to ceph-rbd
- Why: Time-series database with:
- Heavy write workload (constant metric ingestion)
- Random read patterns for queries
- Significant performance gains with block storage
- Better compaction performance
- Impact: HIGH - Largest performance improvement opportunity
- Migration Complexity: High
- Large data volume (200Gi total)
- Update HelmRelease volumeClaimTemplate.spec.storageClassName
- Commit and push changes
- Flux recreates StatefulSet with new storage
- Consider data retention during migration
5. Loki
- Namespace: observability
- Current Storage: cephfs-shared
- Current Size: 30Gi
- Replicas: 1 (StatefulSet)
- Storage Pattern: 🔧 Direct HelmRelease (
kubernetes/apps/observability/loki/app/helmrelease.yaml
) - Recommended: Migrate to ceph-rbd
- Why: Log aggregation database benefits from:
- Better write performance for high-volume log ingestion
- Improved compaction and chunk management
- Block storage better suited for LSM-tree based storage
- Impact: Medium - noticeable improvement in log write performance
- Migration Complexity: Medium
- Moderate data size
- Update HelmRelease singleBinary.persistence.storageClass
- Commit and push changes
- Flux recreates StatefulSet with new storage
- Can tolerate some log loss during migration
6. AlertManager
- Namespace: observability
- Current Storage: cephfs-shared
- Current Size: 2Gi
- Replicas: 1 (StatefulSet)
- Storage Pattern: 🔧 Direct HelmRelease (
kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
) - Recommended: Migrate to ceph-rbd
- Why: Alert state persistence benefits from:
- Consistent snapshot capabilities
- Better fsync performance for state writes
- Impact: Low - small storage footprint, quick migration
- Migration Complexity: Low
- Small data size
- Update HelmRelease alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName
- Commit and push changes
- Flux recreates StatefulSet with new storage
- Minimal downtime
What Should Stay on CephFS
The following workloads are correctly using CephFS and should NOT be migrated:
Media & Shared Files (RWX Access Required)
- Media libraries (Plex, Sonarr, Radarr, etc.) - Need shared filesystem access
- AI models (Ollama 100Gi) - Large files with potential shared access
- Application configs - Often need shared access across pods
Backup Storage
- Volsync repositories (cephfs-static) - Restic repositories work well on filesystem
- MinIO data (cephfs-static, 10Ti) - Object storage on filesystem is appropriate
Other
- OpenEBS etcd/minio - Already using local PVs (mayastor-etcd-localpv, openebs-minio-localpv)
- Runner work volumes - Ephemeral workload storage
Migration Summary
Total Storage to Migrate
- Dragonfly: +30Gi (3 replicas x 10Gi) - NEW storage
- EMQX: +15-30Gi (3 replicas x 5-10Gi) - NEW storage
- CouchDB: 5Gi (migrate from cephfs)
- Prometheus: 200Gi (migrate from cephfs)
- Loki: 30Gi (migrate from cephfs)
- AlertManager: 2Gi (migrate from cephfs)
Total New ceph-rbd Needed: ~280-295Gi Currently Migrating from CephFS: ~237Gi
Recommended Migration Order
-
Phase 0: Validation (Test the process)
- ✅ AlertManager - LOW RISK test case to validate GitOps workflow
-
Phase 1: Data Durability (Immediate)
- Dragonfly - Add persistent storage
- EMQX - Add persistent storage
-
Phase 2: Small Databases (Quick Wins)
- CouchDB - Medium complexity, important for Obsidian data
-
Phase 3: Large Time-Series DBs (Performance)
- Loki - Medium size, good performance gains
- Prometheus - Large size, significant performance gains
Migration Checklists
Phase 0: AlertManager Migration (Validation Test)
Goal: Validate the GitOps migration workflow with a low-risk workload
Pre-Migration Checklist:
-
Verify current AlertManager state
kubectl get pod -n observability -l app.kubernetes.io/name=alertmanager kubectl get pvc -n observability -l app.kubernetes.io/name=alertmanager kubectl describe pvc -n observability alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0 | grep "StorageClass:"
-
Check current storage usage
kubectl exec -n observability alertmanager-kube-prometheus-stack-alertmanager-0 -- df -h /alertmanager
-
Document current alerts (optional - state will rebuild)
kubectl get prometheusrule -A
-
Verify ceph-rbd storage class exists
kubectl get storageclass ceph-rbd kubectl get volumesnapshotclass ceph-rbd-snapshot
Migration Steps:
-
Create feature branch
git checkout -b feat/alertmanager-rbd-migration
-
Update HelmRelease configuration
- File:
kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
- Change:
alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName: ceph-rbd
- Line: ~104 (search for alertmanager storageClassName)
- File:
-
Commit changes
git add kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml git commit -m "feat(alertmanager): migrate to ceph-rbd storage"
-
Push to remote
git push origin feat/alertmanager-rbd-migration
-
Monitor Flux reconciliation
flux reconcile kustomization kube-prometheus-stack -n observability --with-source watch kubectl get pods -n observability -l app.kubernetes.io/name=alertmanager
-
Verify new PVC created with ceph-rbd
kubectl get pvc -n observability -l app.kubernetes.io/name=alertmanager kubectl describe pvc -n observability <new-pvc-name> | grep "StorageClass:"
-
Verify AlertManager is running
kubectl get pod -n observability -l app.kubernetes.io/name=alertmanager kubectl logs -n observability -l app.kubernetes.io/name=alertmanager --tail=50
-
Check AlertManager UI (https://alertmanager.${SECRET_DOMAIN})
- UI loads successfully
- Alerts are being received
- Silences can be created
- Wait 24 hours to verify stability
-
Merge to main
git checkout main git merge feat/alertmanager-rbd-migration git push origin main
Post-Migration Validation:
-
Verify old PVC is deleted (should happen automatically)
kubectl get pvc -A | grep alertmanager
-
Check Ceph RBD usage
kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph df
- Document lessons learned for larger migrations
- Update this checklist with any issues encountered
Rollback Plan (if needed):
-
Revert the commit
git revert HEAD git push origin main
- Flux will recreate AlertManager with cephfs-shared
- Alert state will rebuild (acceptable data loss)
Migration Procedures
Pattern 1: Volsync Component Apps (GitOps Workflow)
Used for: CouchDB, and any app using the volsync component
Steps:
-
Update ks.yaml - Add storage class overrides to
postBuild.substitute
:postBuild: substitute: APP: obsidian-couchdb VOLSYNC_CAPACITY: 5Gi VOLSYNC_STORAGECLASS: ceph-rbd # Changed from default VOLSYNC_ACCESSMODES: ReadWriteOnce # Changed from ReadWriteMany VOLSYNC_SNAPSHOTCLASS: ceph-rbd-snapshot # Changed from cephfs-snapshot VOLSYNC_CACHE_STORAGECLASS: ceph-rbd # For volsync cache VOLSYNC_CACHE_ACCESSMODES: ReadWriteOnce # For volsync cache
-
Commit and push changes to Git repository
-
Flux reconciles automatically:
- Flux detects the change in Git
- Recreates the PVC with new storage class
- Volsync ReplicationDestination restores data from backup
- Application pod starts with new RBD-backed storage
-
Verify the application is running correctly with new storage:
kubectl get pvc -n <namespace> <app> kubectl describe pvc -n <namespace> <app> | grep StorageClass
Example files:
- CouchDB:
kubernetes/apps/selfhosted/obsidian-couchdb/ks.yaml
Pattern 2: Direct HelmRelease Apps (GitOps Workflow)
Used for: Prometheus, Loki, AlertManager
Steps:
For Prometheus & AlertManager:
-
Update helmrelease.yaml - Change storageClassName in volumeClaimTemplate:
# kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml prometheus: prometheusSpec: storageSpec: volumeClaimTemplate: spec: storageClassName: ceph-rbd # Changed from cephfs-shared resources: requests: storage: 100Gi alertmanager: alertmanagerSpec: storage: volumeClaimTemplate: spec: storageClassName: ceph-rbd # Changed from cephfs-shared resources: requests: storage: 2Gi
-
Commit and push changes to Git repository
-
Flux reconciles automatically:
- Flux detects the HelmRelease change
- Helm recreates the StatefulSet
- New PVCs created with ceph-rbd storage class
- Pods start with new storage (data loss acceptable for metrics/alerts)
For Loki:
-
Update helmrelease.yaml - Change storageClass in persistence config:
# kubernetes/apps/observability/loki/app/helmrelease.yaml singleBinary: persistence: enabled: true storageClass: ceph-rbd # Changed from cephfs-shared size: 30Gi
-
Commit and push changes to Git repository
-
Flux reconciles automatically - Same process as Prometheus
Note: For observability workloads, some data loss during migration is typically acceptable since:
- Prometheus has 14d retention - new data will accumulate
- Loki has 14d retention - new logs will accumulate
- AlertManager state is ephemeral and will rebuild
For Services Without Storage (Dragonfly, EMQX)
Steps:
- Update CRD to add volumeClaimTemplates with ceph-rbd
- Commit and push changes
- Flux recreates StatefulSet with persistent storage
- Configure volsync backup strategy (optional)
Important Migration Considerations
Snapshot Class Compatibility
When migrating from CephFS to Ceph RBD, snapshot classes must match the storage backend:
Storage Class | Compatible Snapshot Class |
---|---|
cephfs-shared | cephfs-snapshot |
ceph-rbd | ceph-rbd-snapshot |
Why this matters:
- Volsync uses snapshots for backup/restore operations
- Using the wrong snapshot class will cause volsync to fail
- Both the main storage and cache storage need matching snapshot classes
Available VolumeSnapshotClasses in cluster:
$ kubectl get volumesnapshotclass
NAME DRIVER DELETIONPOLICY
ceph-rbd-snapshot rook-ceph.rbd.csi.ceph.com Delete
cephfs-snapshot rook-ceph.cephfs.csi.ceph.com Delete
csi-nfs-snapclass nfs.csi.k8s.io Delete
Access Mode Changes
Storage Type | Access Mode | Use Case |
---|---|---|
CephFS (cephfs-shared ) | ReadWriteMany (RWX) | Shared filesystems, media libraries |
Ceph RBD (ceph-rbd ) | ReadWriteOnce (RWO) | Databases, block storage |
Impact:
- RBD volumes can only be mounted by one node at a time
- Applications must be single-replica or use StatefulSet with pod affinity
- Most database workloads already use RWO - minimal impact
Volsync Cache Storage
When using volsync with RBD, both the main storage and cache storage should use RBD:
postBuild:
substitute:
# Main PVC settings
VOLSYNC_STORAGECLASS: ceph-rbd
VOLSYNC_ACCESSMODES: ReadWriteOnce
VOLSYNC_SNAPSHOTCLASS: ceph-rbd-snapshot
# Cache PVC settings (must also match RBD)
VOLSYNC_CACHE_STORAGECLASS: ceph-rbd
VOLSYNC_CACHE_ACCESSMODES: ReadWriteOnce
VOLSYNC_CACHE_CAPACITY: 10Gi
Why? Mixing CephFS cache with RBD main storage can cause:
- Snapshot compatibility issues
- Performance inconsistencies
- Backup/restore failures
Technical Notes
- Ceph RBD Pool: Backed by
rook-pvc-pool
- Storage Class:
ceph-rbd
- Access Mode: RWO (ReadWriteOnce) - single node access
- Features: Volume expansion enabled, snapshot support
- Reclaim Policy: Delete
- CSI Driver:
rook-ceph.rbd.csi.ceph.com
References
- Current cluster storage:
kubernetes/apps/storage/
- Database configs:
kubernetes/apps/database/*/cluster/cluster.yaml
- Storage class definition: Managed by Rook operator
VPA-Based Resource Limit Updates
Summary
This document outlines a plan to systematically update resource limits across the cluster based on VPA (Vertical Pod Autoscaler) recommendations from Goldilocks to eliminate CPU throttling alerts.
Changes Already Made
1. Alert Configuration
File: kubernetes/apps/observability/kube-prometheus-stack/app/alertmanagerconfig.yaml
- Changed default receiver from
pushover
to"null"
- Added explicit routes for
severity: warning
andseverity: critical
to pushover - Result: Only critical and warning alerts will trigger pushover notifications (no more info-level spam)
2. Promtail Resources
File: kubernetes/apps/observability/promtail/app/helmrelease.yaml
- CPU Request: 50m → 100m
- CPU Limit: 100m → 250m
- Rationale: VPA recommends 101m upper bound, but we added headroom for log bursts
Priority Workloads for Update
High Priority (Currently Throttling or at Risk)
Observability Namespace
-
Loki - Log aggregation
- Current: cpu: 35m request, 200m limit
- VPA: cpu: 23m request, 140m limit
- Action: Keep current limits (already adequate)
-
Grafana - Visualization
- Current: No CPU limits
- VPA: cpu: 63m request, 213m limit
- Action: Add limits - 100m request, 500m limit for burst capacity
-
Internal Nginx Ingress (network namespace)
- Current: cpu: 500m request, no limit
- VPA: cpu: 63m request, 316m limit
- Action: Add 500m limit (keep generous for traffic spikes)
Medium Priority (Good to standardize)
Observability Namespace
-
kube-state-metrics
- VPA: cpu: 23m request, 77m limit
- Action: Add resources block
-
Goldilocks Controller
- VPA: cpu: 587m request, 2268m limit (!)
- Action: Add generous limits for this workload
-
Blackbox Exporter
- VPA: cpu: 15m request, 37m limit
- Action: Add resources block
Network Namespace
-
External Nginx Ingress
- VPA: cpu: 49m request, 165m limit
- Action: Add resources block
-
Cloudflared
- VPA: cpu: 15m request, 214m limit
- Action: Add resources block (note the high burst)
Low Priority (Already well-configured)
- Node Exporter: Current limits are generous (250m limit vs 22m VPA)
- DCGM Exporter: Has limits, VPA shows adequate
- Media workloads: Most have no CPU limits (intentional for high CPU apps like Plex, Bazarr)
Implementation Strategy
Phase 1: Stop the Alerts (DONE ✅)
- Update alertmanagerconfig to filter by severity
- Update promtail CPU limits
Phase 2: Observability Namespace (Next)
Update these critical monitoring components:
- Grafana - Add CPU limits
- kube-state-metrics - Add resources
- Goldilocks controller - Add resources
- Blackbox exporter - Add resources
Phase 3: Network Infrastructure
- Internal nginx ingress - Add CPU limit
- External nginx ingress - Add resources
- Cloudflared - Add resources
Phase 4: Optional Refinements
- Review VPA recommendations quarterly
- Adjust limits based on actual usage patterns
- Consider enabling VPA auto-mode for non-critical workloads
How to Use VPA Recommendations
1. View All Recommendations
# Run the helper script
./scripts/vpa-resource-recommendations.sh
# Or visit the dashboard
open https://goldilocks.chelonianlabs.com
2. Get Specific Workload Recommendations
kubectl get vpa -n observability goldilocks-grafana -o jsonpath='{.status.recommendation.containerRecommendations[0]}' | jq
3. Update HelmRelease
Add resources block under values:
:
values:
resources:
requests:
cpu: <vpa_target>
memory: <vpa_target_memory>
limits:
cpu: <vpa_upper_or_2x_for_bursts>
memory: <vpa_upper_memory>
4. Apply and Monitor
# Commit changes
git add kubernetes/apps/observability/grafana/app/helmrelease.yaml
git commit -m "feat(grafana): add CPU limits based on VPA recommendations"
git push
# Force reconciliation (optional)
flux reconcile helmrelease -n observability grafana
# Monitor for throttling
kubectl top pods -n observability --containers
VPA Interpretation Guide
VPA Recommendation Fields:
target
: Use as your request valuelowerBound
: Minimum to functionupperBound
: Use as limit (or higher for burst workloads)uncappedTarget
: What VPA thinks is ideal without constraints
When to Deviate:
- Burst workloads (logs, ingress): Use 2-3x upper bound for limits
- Background jobs: Match VPA recommendations closely
- User-facing apps: Add 50-100% headroom for traffic spikes
- Resource-constrained: Start with target, monitor, then adjust
Monitoring for Success
After updates, verify alerts have stopped:
# Check for CPU throttling alerts
kubectl get alerts -A | grep -i throttl
# Check actual CPU usage vs limits
kubectl top pods -A --containers | sort -k4 -h -r | head -20
# Review VPA over time
watch kubectl get vpa -n observability
Tools Created
scripts/vpa-resource-recommendations.sh
- Extract VPA recommendations with HelmRelease file locations- This document - Implementation plan and guidance