Dapper Cluster Documentation

This documentation covers the architecture, configuration, and operations of the Dapper Kubernetes cluster, a high-performance home lab infrastructure with GPU capabilities.

Cluster Overview

graph TD
    subgraph Control Plane
        CP1[Control Plane 1<br>4 CPU, 16GB]
        CP2[Control Plane 2<br>4 CPU, 16GB]
        CP3[Control Plane 3<br>4 CPU, 16GB]
    end

    subgraph Worker Nodes
        W1[Worker 1<br>16 CPU, 128GB]
        W2[Worker 2<br>16 CPU, 128GB]
        GPU[GPU Node<br>16 CPU, 128GB<br>4x Tesla P100]
    end

    CP1 --- CP2
    CP2 --- CP3
    CP3 --- CP1

    Control Plane --> Worker Nodes

Hardware Specifications

Control Plane

  • 3 nodes for high availability
  • 4 CPU cores per node
  • 16GB RAM per node
  • Dedicated to cluster control plane operations

Worker Nodes

  • 2 general-purpose worker nodes
  • 16 CPU cores per node
  • 128GB RAM per node
  • Handles general workloads and applications

GPU Node

  • Specialized GPU worker node
  • 16 CPU cores
  • 128GB RAM
  • 4x NVIDIA Tesla P100 GPUs
  • Handles ML/AI and GPU-accelerated workloads

Key Features

  • High-availability Kubernetes cluster
  • GPU acceleration support
  • Automated deployment using Flux CD
  • Secure secrets management with SOPS
  • NFS and OpenEBS storage integration
  • Comprehensive monitoring and observability
  • Media services automation

Infrastructure Components

graph TD
    subgraph Core Services
        Flux[Flux CD]
        Storage[Storage Layer]
        Network[Network Layer]
    end

    subgraph Applications
        Media[Media Stack]
        Monitor[Monitoring]
        GPU[GPU Workloads]
    end

    Core Services --> Applications

    Storage --> |NFS/OpenEBS| Applications
    Network --> |Ingress/DNS| Applications

Documentation Structure

  • Architecture: Detailed technical documentation about cluster design and components

    • High-availability control plane design
    • Storage architecture and configuration
    • Network topology and policies
    • GPU integration and management
  • Applications: Information about deployed applications and their configurations

    • Media services stack
    • Monitoring and observability
    • GPU-accelerated applications
  • Operations: Guides for installation, maintenance, and troubleshooting

    • Cluster setup procedures
    • Node management
    • GPU configuration
    • Maintenance tasks

Getting Started

For new users, we recommend starting with:

  1. Architecture Overview - Understanding the cluster design
  2. Installation Guide - Setting up the cluster
  3. Application Stack - Deploying applications

Architecture Overview

Cluster Architecture

graph TD
    subgraph Control Plane
        CP1[Control Plane 1<br>4 CPU, 16GB]
        CP2[Control Plane 2<br>4 CPU, 16GB]
        CP3[Control Plane 3<br>4 CPU, 16GB]

        CP1 --- CP2
        CP2 --- CP3
        CP3 --- CP1
    end

    subgraph Worker Nodes
        W1[Worker 1<br>16 CPU, 128GB]
        W2[Worker 2<br>16 CPU, 128GB]
    end

    subgraph GPU Node
        GPU[GPU Worker<br>16 CPU, 128GB<br>4x Tesla P100]
    end

    Control Plane --> Worker Nodes
    Control Plane --> GPU

Core Components

Control Plane

  • High Availability: 3-node control plane configuration
  • Resource Allocation: 4 CPU, 16GB RAM per node
  • Components:
    • etcd cluster
    • API Server
    • Controller Manager
    • Scheduler

Worker Nodes

  • General Purpose Workers: 2 nodes
  • Resources per Node:
    • 16 CPU cores
    • 128GB RAM
  • Workload Types:
    • Application deployments
    • Database workloads
    • Media services
    • Monitoring systems

GPU Node

  • Specialized Worker: 1 node
  • Hardware:
    • 16 CPU cores
    • 128GB RAM
    • 4x NVIDIA Tesla P100 GPUs
  • Workload Types:
    • ML/AI workloads
    • Video transcoding
    • GPU-accelerated applications

Network Architecture

graph TD
    subgraph External
        Internet((Internet))
        DNS((DNS))
    end

    subgraph Network Edge
        FW[Firewall]
        LB[Load Balancer]
    end

    subgraph Kubernetes Network
        CP[Control Plane]
        Workers[Worker Nodes]
        GPUNode[GPU Node]

        subgraph Services
            Ingress[Ingress Controller]
            CoreDNS[CoreDNS]
            CNI[Network Plugin]
        end
    end

    Internet --> FW
    DNS --> FW
    FW --> LB
    LB --> CP
    CP --> Workers
    CP --> GPUNode
    Services --> Workers
    Services --> GPUNode

Storage Architecture

graph TD
    subgraph Storage Classes
        NFS[NFS Storage Class]
        OpenEBS[OpenEBS Storage Class]
    end

    subgraph Persistent Volumes
        NFS --> NFS_PV[NFS PVs]
        OpenEBS --> Local_PV[Local PVs]
    end

    subgraph Workload Types
        NFS_PV --> Media[Media Storage]
        NFS_PV --> Shared[Shared Config]
        Local_PV --> DB[Databases]
        Local_PV --> Cache[Cache Storage]
    end

Security Considerations

  • Network segmentation using Kubernetes network policies
  • Encrypted secrets management with SOPS
  • TLS encryption for all external services
  • Regular security updates via automated pipelines
  • GPU access controls and resource quotas

Scalability

The cluster architecture is designed to be scalable:

  • High-availability control plane (3 nodes)
  • Expandable worker node pool
  • Specialized GPU node for compute-intensive tasks
  • Dynamic storage provisioning
  • Load balancing for external services
  • Resource quotas and limits management

Monitoring and Observability

graph LR
    subgraph Monitoring Stack
        Prom[Prometheus]
        Graf[Grafana]
        Alert[Alertmanager]
    end

    subgraph Node Types
        CP[Control Plane Metrics]
        Work[Worker Metrics]
        GPU[GPU Metrics]
    end

    CP --> Prom
    Work --> Prom
    GPU --> Prom
    Prom --> Graf
    Prom --> Alert

Resource Management

Control Plane

  • Reserved for kubernetes control plane components
  • Optimized for control plane operations
  • High availability configuration

Worker Nodes

  • General purpose workloads
  • Balanced resource allocation
  • Flexible scheduling options

GPU Node

  • Dedicated for GPU workloads
  • NVIDIA GPU operator integration
  • Specialized resource scheduling

Network Architecture

This document covers the Kubernetes application-level networking. For physical network topology and VLAN configuration, see Network Topology.

Container Networking (CNI)

Cilium CNI

The cluster uses Cilium as the primary Container Network Interface (CNI):

  • Pod CIDR: 10.69.0.0/16 (native routing mode)
  • Service CIDR: 10.96.0.0/16
  • Mode: Non-exclusive (paired with Multus for multi-network support)
  • Kube-Proxy Replacement: Enabled (eBPF-based service load balancing)
  • Load Balancing Algorithm: Maglev with DSR (Direct Server Return)
  • Network Policy: Endpoint routes enabled
  • BPF Masquerading: Enabled for outbound traffic

Key Features:

  • High-performance eBPF data plane
  • Native Kubernetes network policy support
  • L2 announcements for external load balancer IPs
  • Advanced observability and monitoring

Multus CNI (Multiple Networks)

Multus provides additional network interfaces to pods beyond the primary Cilium network:

  • Primary Use: IoT network attachment (VLAN-based isolation)
  • Network Attachment: macvlan on ens19 interface
  • Mode: Bridge mode with DHCP IPAM
  • Purpose: Enable pods to connect to additional networks (e.g., IoT devices, legacy systems)

Pods can request additional networks via annotations:

metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: macvlan-conf

Ingress Controllers

The cluster uses dual ingress-nginx controllers for traffic routing:

Internal Ingress

  • Class: internal (default)
  • Purpose: Internal services, private DNS
  • Version: v4.13.3
  • Load Balancer: Cilium L2 announcement
  • DNS: Synced to internal DNS via k8s-gateway and External-DNS (UniFi webhook)

External Ingress

  • Class: external
  • Purpose: Public-facing services
  • Version: v4.13.3
  • Load Balancer: Cilium L2 announcement
  • DNS: Synced to Cloudflare via External-DNS
  • Tunnel: Cloudflared for secure access

Load Balancer IP Management

Cilium L2 Announcements

Cilium's L2 announcement feature provides load balancer IPs for services:

  • How it works: Cilium announces load balancer IPs via L2 (ARP/NDP)
  • Policy-based: L2AnnouncementPolicy defines which services get announced
  • Benefits:
    • No external load balancer required
    • Native Kubernetes LoadBalancer service type support
    • High availability through leader election
    • Automatic failover

Configuration: See kubernetes/apps/kube-system/cilium/config/l2.yaml

This enables both ingress controllers to receive external IPs that are accessible from the broader network.

Network Policies

graph LR
    subgraph Policies
        Default[Default Deny]
        Allow[Allowed Routes]
    end

    subgraph Apps
        Media[Media Stack]
        Monitor[Monitoring]
        DB[Databases]
    end

    Allow --> Media
    Allow --> Monitor
    Default --> DB

DNS Configuration

Internal DNS (k8s-gateway)

  • Purpose: DNS server for internal ingresses
  • Domain: Internal cluster services
  • Integration: Works with External-DNS for automatic record creation

External-DNS (Dual Instances)

Instance 1: Internal DNS

  • Provider: UniFi (via webhook provider)
  • Target: UDM Pro Max
  • Ingress Class: internal
  • Purpose: Sync private DNS records for internal services

Instance 2: External DNS

  • Provider: Cloudflare
  • Ingress Class: external
  • Purpose: Sync public DNS records for externally accessible services

How DNS Works

  1. Create an Ingress with class internal or external
  2. External-DNS watches for new/updated ingresses
  3. Appropriate External-DNS instance syncs DNS records to target provider
  4. Services become accessible via their configured hostnames

Security

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

TLS Configuration

  • Automatic certificate management via cert-manager
  • Let's Encrypt integration
  • Internal PKI for service mesh

Service Mesh

Traffic Flow

graph LR
    subgraph Ingress
        External[External Traffic]
        Traefik[Traefik]
    end

    subgraph Services
        App1[Service 1]
        App2[Service 2]
        DB[Database]
    end

    External --> Traefik
    Traefik --> App1
    Traefik --> App2
    App1 --> DB
    App2 --> DB

Best Practices

  1. Security

    • Implement default deny policies
    • Use TLS everywhere
    • Regular security audits
    • Network segmentation
  2. Performance

    • Load balancer optimization
    • Connection pooling
    • Proper resource allocation
    • Traffic monitoring
  3. Reliability

    • High availability configuration
    • Failover planning
    • Backup routes
    • Health checks
  4. Monitoring

    • Network metrics collection
    • Traffic analysis
    • Latency monitoring
    • Bandwidth usage tracking

Troubleshooting

Common network issues and resolution steps:

  1. Connectivity Issues

    • Check network policies
    • Verify DNS resolution
    • Inspect service endpoints
    • Review ingress configuration
  2. Performance Problems

    • Monitor network metrics
    • Check for bottlenecks
    • Analyze traffic patterns
    • Review resource allocation

Network Topology

Overview

The Dapper Cluster network spans two physical locations (garage/shop and house) connected via a 1Gbps wireless bridge. The network uses a dual-switch design in the garage with high-speed interconnects for server and storage traffic.

Network Locations

Garage/Shop (Server Room)

  • Primary compute infrastructure
  • Core and distribution switches
  • 4x Proxmox hosts running Talos/Kubernetes
  • High-speed storage network

House

  • Access layer switch
  • OPNsense router/firewall
  • Client devices
  • Connected to garage via 60GHz wireless bridge (1Gbps)

Device Inventory

Core Network Equipment

DeviceModelManagement IPLocationRoleNotes
Brocade CoreICX6610192.168.1.20GarageCore/L3 SwitchManages VLAN 150, 200 routing
Arista Distribution7050192.168.1.21GarageDistribution SwitchHigh-speed 40Gb interconnects
Aruba AccessS2500-48p192.168.1.26HouseAccess SwitchPoE, client devices
OPNsense Routeri3-4130T - 16GB192.168.1.1HouseRouter/FirewallManages VLAN 1, 100 routing
Mikrotik Radio (House)NRay60192.168.1.7HouseWireless Bridge1Gbps to garage
Mikrotik Radio (Shop)NRay60192.168.1.8GarageWireless Bridge1Gbps to house
Mikrotik SwitchCSS326-24G-2S192.168.1.27GarageWireless Bridge - Brocade CoreAlways up interconnect

Compute Infrastructure

DeviceManagement IPIPMI IPLocationLinksNotes
Proxmox Host 1192.168.1.62192.168.1.162Garage6 total3x 1Gb, 2x 10Gb, 1x 40Gb
Proxmox Host 2192.168.1.63192.168.1.165Garage6 total3x 1Gb, 2x 10Gb, 1x 40Gb
Proxmox Host 3192.168.1.64192.168.1.163Garage6 total3x 1Gb, 2x 10Gb, 1x 40Gb
Proxmox Host 4192.168.1.66192.168.1.164Garage6 total3x 1Gb, 2x 10Gb, 1x 40Gb

Kubernetes Nodes (VMs on Proxmox)

HostnamePrimary IP (VLAN 100)Storage IP (VLAN 150)RoleHost
talos-control-110.100.0.5010.150.0.10Control PlaneProxmox-03
talos-control-210.100.0.5110.150.0.11Control PlaneProxmox-04
talos-control-310.100.0.5210.150.0.12Control PlaneProxmox-02
talos-node-gpu-110.100.0.5310.150.0.13Worker (GPU)Proxmox-03
talos-node-large-110.100.0.5410.150.0.14WorkerProxmox-03
talos-node-large-210.100.0.5510.150.0.15WorkerProxmox-03
talos-node-large-310.100.0.5610.150.0.16WorkerProxmox-03

Kubernetes Cluster VIP: 10.100.0.40 (shared across control plane nodes)


VLAN Configuration

VLAN IDNetworkSubnetGatewayMTUPurposeGateway DeviceNotes
1LAN192.168.1.0/24192.168.1.11500Management, clientsOPNsenseDefault VLAN
100SERVERS10.100.0.0/2410.100.0.11500Kubernetes nodes, VMsOPNsensePrimary server network
150CEPH-PUBLIC10.150.0.0/24None (internal)9000Ceph client/monitorBrocadeJumbo frames enabled, no gateway needed
200CEPH-CLUSTER10.200.0.0/24None (internal)9000Ceph OSD replicationAristaJumbo frames enabled, no gateway needed

Kubernetes Internal Networks

NetworkCIDRPurposeMTU
Pod Network10.69.0.0/16Cilium pod CIDR1500
Service Network10.96.0.0/16Kubernetes services1500

VLAN Tagging Summary

  • Tagged (Trunked): All inter-switch links, Proxmox host uplinks (for VM traffic)
  • Untagged Access Ports: Client devices on appropriate VLANs
  • [TODO: Document which VLANs are allowed on which trunk ports]

Physical Topology

High-Level Site Connectivity

graph TB
    subgraph House
        ARUBA[Aruba S2500-48p<br/>192.168.1.26]
        OPNS[OPNsense Router<br/>Gateway for VLAN 1, 100]
        MTIKHOUSE[Mikrotik NRay60<br/>192.168.1.7]
        CLIENTS[Client Devices]

        OPNS --- ARUBA
        ARUBA --- CLIENTS
        ARUBA --- MTIKHOUSE
    end

    subgraph Wireless Bridge [60GHz Wireless Bridge - 1Gbps]
        MTIKHOUSE <-.1Gbps Wireless.-> MTIKSHOP
    end

    subgraph Garage/Shop
        MTIKSHOP[Mikrotik NRay60<br/>192.168.1.8]
        BROCADE[Brocade ICX6610<br/>192.168.1.20<br/>Core/L3 Switch]
        ARISTA[Arista 7050<br/>192.168.1.21<br/>Distribution Switch]

        MTIKSHOP --- BROCADE

        PX1[Proxmox Host 1<br/>6 links]
        PX2[Proxmox Host 2<br/>6 links]
        PX3[Proxmox Host 3<br/>6 links]
        PX4[Proxmox Host 4<br/>6 links]

        BROCADE <-->|2x 40Gb QSFP+<br/>ONE DISABLED| ARISTA

        PX1 --> BROCADE
        PX2 --> BROCADE
        PX3 --> BROCADE
        PX4 --> BROCADE

        PX1 -.40Gb.-> ARISTA
        PX2 -.40Gb.-> ARISTA
        PX3 -.40Gb.-> ARISTA
        PX4 -.40Gb.-> ARISTA
    end

    style BROCADE fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
    style ARISTA fill:#389826,stroke:#fff,stroke-width:2px,color:#fff
    style ARUBA fill:#d83933,stroke:#fff,stroke-width:2px,color:#fff

Proxmox Host Connectivity Detail

Each Proxmox host has 6 network connections:

graph LR
    subgraph Proxmox Host [Single Proxmox Host - 6 Links]
        IPMI[IPMI/BMC<br/>1Gb NIC]
        MGMT[Management Bond<br/>2x 1Gb]
        VM[VM Bridge Bond<br/>2x 10Gb]
        CEPH[Ceph Storage<br/>1x 40Gb]
    end

    subgraph Brocade ICX6610
        B1[Port: 1Gb]
        B2[LAG: 2x 1Gb]
        B3[LAG: 2x 10Gb]
    end

    subgraph Arista 7050
        A1[Port: 40Gb]
    end

    IPMI -->|Standalone| B1
    MGMT -->|LACP Bond| B2
    VM -->|LACP Bond<br/>VLAN 100, 150| B3
    CEPH -->|Standalone<br/>VLAN 200| A1

    style Brocade fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
    style Arista fill:#389826,stroke:#fff,stroke-width:2px,color:#fff

Per-Host Link Summary:

  • IPMI: 1x 1Gb to Brocade (dedicated management)
  • Proxmox Management: 2x 1Gb LACP bond to Brocade (Proxmox host IP)
  • VM Traffic: 2x 10Gb LACP bond to Brocade (bridges for VMs, VLAN 100, 150)
  • Ceph Cluster: 1x 40Gb to Arista (VLAN 200 only)

Total Bandwidth per Host:

  • To Brocade: 23 Gbps (3 + 20 Gbps)
  • To Arista: 40 Gbps

Brocade-Arista Interconnect (ISSUE)

graph LR
    subgraph Brocade ICX6610
        BP1[QSFP+ Port 1<br/>40Gb]
        BP2[QSFP+ Port 2<br/>40Gb]
    end

    subgraph Arista 7050
        AP1[QSFP+ Port 1<br/>40Gb<br/>ACTIVE]
        AP2[QSFP+ Port 2<br/>40Gb<br/>DISABLED]
    end

    BP1 ---|Currently: Simple Trunk| AP1
    BP2 -.-|DISABLED to prevent loop| AP2

    style AP2 fill:#ff0000,stroke:#fff,stroke-width:2px,color:#fff

CURRENT ISSUE:

  • 2x 40Gb links are configured as separate trunk ports (default VLAN 1, passing all VLANs)
  • This creates a layer 2 loop
  • ONE port disabled on Arista side as workaround
  • SOLUTION NEEDED: Configure proper LACP/port-channel on both switches

[TODO: Document target LAG configuration]


Logical Topology

Layer 2 VLAN Distribution

graph TB
    subgraph Layer 2 VLANs
        V1[VLAN 1: Management<br/>192.168.1.0/24]
        V100[VLAN 100: Servers<br/>10.100.0.0/24]
        V150[VLAN 150: Ceph Public<br/>10.150.0.0/24]
        V200[VLAN 200: Ceph Cluster<br/>10.200.0.0/24]
    end

    subgraph Brocade Core
        B[Brocade ICX6610<br/>L3 Gateway<br/>VLAN 150, 200]
    end

    subgraph Arista Distribution
        A[Arista 7050<br/>L2 Only]
    end

    subgraph OPNsense Router
        O[OPNsense<br/>L3 Gateway<br/>VLAN 1, 100]
    end

    V1 --> B
    V100 --> B
    V150 --> B
    V200 --> A

    B -->|Routes to| O
    A --> B

    style B fill:#2f73d8,stroke:#fff,stroke-width:2px,color:#fff
    style O fill:#d83933,stroke:#fff,stroke-width:2px,color:#fff

Layer 3 Routing

Primary Gateways:

  • OPNsense (at house):

    • VLAN 1: 192.168.1.1
    • VLAN 100: 10.100.0.1
    • Default gateway for internet access
    • 2.5GB ATT Fiber
  • Brocade ICX6610 (at garage):

    • VLAN 1: 192.168.1.20
    • VLAN 100: 10.100.0.10
    • VLAN 150: None
    • VLAN 200: None
    • VIP's that route back to the gateway at 192.168.1.1 or 10.100.0.1

[TODO: Document inter-VLAN routing rules]

  • Can VLAN 150/200 reach the internet? USER-TODO: (Need to check not sure)
  • Are there firewall rules blocking inter-VLAN traffic?
  • How does Ceph traffic route if needed?

Traffic Flows

VLAN 1 (Management) - 192.168.1.0/24

Purpose: Switch management, IPMI, admin access

Flow:

Client (House)
  → Aruba Switch
  → Wireless Bridge (1Gbps)
  → Brocade
  → Switch/IPMI management interface

Devices:

  • All switch management IPs
  • Proxmox IPMI interfaces
  • Admin workstations
  • [TODO: Complete device list]

VLAN 100 (Servers) - 10.100.0.0/24

Purpose: Kubernetes nodes, VM primary network

Flow:

Talos Node (10.100.0.50-56)
  → Proxmox VM Bridge (10Gb bond)
  → Brocade (2x 10Gb bond)
  → Routes through OPNsense for internet

Key Services:

  • Kubernetes API: 10.100.0.40:6443 (VIP)
  • Talos nodes: 10.100.0.50-56

Internet Access: Yes (via OPNsense gateway)


VLAN 150 (Ceph Public) - 10.150.0.0/24

Purpose: Ceph client connections, monitor communication, CSI drivers

MTU: 9000 (Jumbo frames)

Flow:

Kubernetes Pod (needs storage)
  → Rook CSI Driver
  → Talos Node (10.150.0.10-16)
  → Proxmox VM Bridge (10Gb bond)
  → Brocade (2x 10Gb bond)
  → Ceph Monitors on Proxmox hosts

Key Services:

  • Ceph Monitors: 10.150.0.4, 10.150.0.2
  • Kubernetes nodes: 10.150.0.10-16 (secondary IPs)
  • Rook CSI drivers connect via this network

Gateway: None required (internal-only network) Internet Access: Not needed (Ceph storage network)

Performance:

  • 2x 10Gb bonded links per host
  • Jumbo frames (MTU 9000)
  • Shared with VLAN 100 on same physical bond

VLAN 200 (Ceph Cluster) - 10.200.0.0/24

Purpose: Ceph OSD replication, cluster heartbeat (backend traffic)

MTU: 9000 (Jumbo frames)

Flow:

Ceph OSD on Proxmox Host 1
  → 40Gb link to Arista
  → Arista 7050 (switch fabric)
  → 40Gb link to Proxmox Host 2-4
  → Ceph OSD on other hosts

Key Characteristics:

  • Dedicated high-speed path: Uses 40Gb links exclusively
  • East-west traffic only: OSD-to-OSD replication
  • Does NOT traverse Brocade for data path
  • Arista provides switching for this VLAN

Gateway: None required (internal-only network) Internet Access: Not needed (Ceph backend replication only)

Performance:

  • 40Gbps per host to Arista
  • Dedicated bandwidth (not shared with other traffic)
  • Jumbo frames critical for large object transfers

[TODO: Document Proxmox host IPs on this VLAN] USER-TODO: Need to choose/ configure


Traffic Segregation Summary

VLANPhysical PathBandwidthMTUShared?
1 (Management)1Gb/10Gb to BrocadeShared1500Yes
100 (Servers)2x 10Gb bond to Brocade20 Gbps1500Yes (with VLAN 150)
150 (Ceph Public)2x 10Gb bond to Brocade20 Gbps9000Yes (with VLAN 100)
200 (Ceph Cluster)1x 40Gb to Arista40 Gbps9000No (dedicated)

Switch Configuration

Brocade ICX6610 Configuration

Role: Core L3 switch, VLAN routing for 150/200

Port Assignments:

[TODO: Document port assignments]

Example:
- Ports 1/1/1-4: IPMI connections (VLAN 1 untagged)
- Ports 1/1/5-12: Proxmox management bonds (LAG groups)
- Ports 1/1/13-20: Proxmox 10Gb bonds (LAG groups, trunk VLAN 100, 150)
- Ports 1/1/41-42: 40Gb to Arista (LAG group, trunk all VLANs)
- Port 1/1/48: Uplink to Mikrotik (trunk all VLANs)

VLAN Interfaces (SVI):

[TODO: Brocade config snippet]

interface ve 1
  ip address 192.168.1.20/24

interface ve 150
  ip address [TODO]/24
  mtu 9000

interface ve 200
  ip address [TODO]/24
  mtu 9000

Static Routes:

[TODO: Document static routes to OPNsense]

Arista 7050 Configuration

Role: High-speed distribution for VLAN 200 (Ceph cluster)

Port Assignments:

[TODO: Document port assignments]

Example:
- Ports Et1-4: Proxmox 40Gb links (VLAN 200 tagged)
- Ports Et49-50: 40Gb to Brocade (port-channel, trunk all VLANs)

Configuration:

[TODO: Arista config snippet for port-channel]

Aruba S2500-48p Configuration

Role: Access switch at house

Uplink: Via Mikrotik wireless bridge to garage

[TODO: Document VLAN configuration and port assignments]

Common Configuration Tasks

Fix Brocade-Arista LAG Issue

Current State: One 40Gb link disabled to prevent loop

Target State: Both 40Gb links in LACP port-channel

Brocade Configuration:

[TODO: Brocade LACP config]

lag "brocade-to-arista" dynamic id [lag-id]
  ports ethernet 1/1/41 to 1/1/42
  primary-port 1/1/41
  deploy

interface ethernet 1/1/41
  link-aggregate active

interface ethernet 1/1/42
  link-aggregate active

Arista Configuration:

[TODO: Arista LACP config]

interface Port-Channel1
  description Link to Brocade ICX6610
  switchport mode trunk
  switchport trunk allowed vlan 1,100,150,200

interface Ethernet49
  channel-group 1 mode active
  description Link to Brocade 40G-1

interface Ethernet50
  channel-group 1 mode active
  description Link to Brocade 40G-2

Performance Characteristics

Bandwidth Allocation

Total Uplink Capacity (Garage to House):

  • 1 Gbps (Mikrotik 60GHz bridge)
  • Bottleneck: All VLAN 1 and internet-bound traffic limited to 1Gbps

Garage Internal Bandwidth:

  • Brocade to Hosts: 92 Gbps aggregate (12x 1Gb + 8x 10Gb bonds)
  • Arista to Hosts: 160 Gbps (4x 40Gb)
  • Brocade-Arista: 40 Gbps (when LAG working: 80 Gbps)

Expected Traffic Patterns

High Bandwidth Flows:

  1. Ceph OSD replication (VLAN 200) - 40Gb per host
  2. Ceph client I/O (VLAN 150) - 20Gb shared per host
  3. VM network traffic (VLAN 100) - 20Gb shared per host

Constrained Flows:

  1. Internet access - limited to 1Gbps wireless bridge
  2. Management traffic - shared 1Gbps wireless bridge

Troubleshooting Reference

Connectivity Testing

Test Management Access:

# From any client
ping 192.168.1.20  # Brocade
ping 192.168.1.21  # Arista
ping 192.168.1.26  # Aruba

# Test across wireless bridge
ping 192.168.1.7   # Mikrotik House
ping 192.168.1.8   # Mikrotik Shop

Test VLAN 100 (Servers):

ping 10.100.0.40   # Kubernetes VIP
ping 10.100.0.50   # Talos control-1

Test VLAN 150 (Ceph Public):

ping 10.150.0.10   # Talos control-1 storage interface

Brocade:

show lag
show interface ethernet 1/1/41
show interface ethernet 1/1/42

Arista:

show port-channel summary
show interface ethernet 49
show interface ethernet 50

Monitor Traffic

Brocade:

show interface ethernet 1/1/41 | include rate
show interface ethernet 1/1/42 | include rate

Check VLAN configuration:

show vlan
show interface brief

Known Issues and Gotchas

Active Issues

  1. Brocade-Arista Interconnect Loop

    • Symptom: Network storms, high CPU on switches, connectivity issues
    • Current Workaround: One 40Gb link disabled on Arista side
    • Root Cause: Links configured as separate trunks instead of LAG
    • Solution: Configure LACP/port-channel on both switches (see above)
  2. [TODO: Document other known issues]

Design Considerations

  1. Wireless Bridge Bottleneck

    • All internet traffic and house-to-garage limited to 1Gbps
    • Management access during wireless outage is difficult
    • Consider: OOB management network or local crash cart access
  2. Single Point of Failure

    • Wireless bridge failure isolates garage from house
    • Brocade failure loses routing for VLAN 150/200
    • Consider: Redundancy strategy
  3. VLAN 200 Routing

    • If gateway is on Brocade but traffic flows through Arista, need to verify routing
    • Confirm: Does VLAN 200 need a gateway at all? (internal only)

Future Improvements

[TODO: Document planned network changes]

  • Fix Brocade-Arista LAG to enable second 40Gb link
  • Document complete port assignments for all switches
  • Add network monitoring/observability (Prometheus exporters?)
  • Consider redundant wireless link or fiber between buildings
  • Implement proper change management for switch configs
  • [TODO: Add your planned improvements]

Change Log

DateChangePersonNotes
2025-10-14Initial documentation createdClaudeBaseline network topology documentation
[TODO][TODO][TODO][TODO]

References

  • Talos Configuration: kubernetes/bootstrap/talos/talconfig.yaml
  • Network Patches: kubernetes/bootstrap/talos/patches/global/machine-network.yaml
  • Kubernetes Network: See docs/src/architecture/network.md for application-level networking
  • Storage Network: See docs/src/architecture/storage.md for Ceph network details

Storage Architecture

Storage Overview

The Dapper Cluster uses Rook Ceph as its primary storage solution, providing unified storage for all Kubernetes workloads. The external Ceph cluster runs on Proxmox hosts and is connected to the Kubernetes cluster via Rook's external cluster mode.

graph TD
    subgraph External Ceph Cluster
        MON[Ceph Monitors]
        OSD[Ceph OSDs]
        MDS[Ceph MDS - CephFS]
    end

    subgraph Kubernetes Cluster
        ROOK[Rook Operator]
        CSI[Ceph CSI Drivers]
        SC[Storage Classes]
    end

    subgraph Applications
        APPS[Application Pods]
        PVC[Persistent Volume Claims]
    end

    MON --> ROOK
    ROOK --> CSI
    CSI --> SC
    SC --> PVC
    PVC --> APPS
    CSI --> MDS
    CSI --> OSD

Storage Architecture Decision

Why Rook Ceph?

The cluster migrated from OpenEBS Mayastor and various NFS backends to Rook Ceph for several key reasons:

  1. Unified Storage Platform: Single storage solution for all workload types
  2. External Cluster Design: Leverages existing Proxmox Ceph cluster infrastructure
  3. High Performance: Direct Ceph integration without NFS overhead
  4. Scalability: Native Ceph scalability for growing storage needs
  5. Feature Rich: Snapshots, cloning, expansion, and advanced storage features
  6. ReadWriteMany Support: CephFS provides shared filesystem access
  7. Production Proven: Mature, widely-adopted storage solution

Migration History

  • Previous: OpenEBS Mayastor (block storage) + Unraid NFS backends (shared storage)
  • Current: Rook Ceph with CephFS and RBD (unified storage platform)
  • In Progress: Decommissioning Unraid servers (tower/tower-2) in favor of Ceph

Current Storage Classes

CephFS Shared Storage (Default)

Storage Class: cephfs-shared

Primary storage class for all workloads requiring dynamic provisioning.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs-shared
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: cephfs
  pool: cephfs_data
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate

Characteristics:

  • Access Mode: ReadWriteMany (RWX) - Multiple pods can read/write simultaneously
  • Use Cases:
    • Applications requiring shared storage
    • Media applications
    • Backup repositories (VolSync)
    • Configuration storage
    • General application storage
  • Performance: Good performance for most workloads, shared filesystem overhead
  • Default: Yes - all PVCs without explicit storageClassName use this

CephFS Static Storage

Storage Class: cephfs-static

Used for pre-existing CephFS paths that need to be mounted into Kubernetes.

Characteristics:

  • Access Mode: ReadWriteMany (RWX)
  • Use Cases:
    • Mounting existing data directories (e.g., /truenas/* paths)
    • Large media libraries
    • Shared configuration repositories
    • Data migration scenarios
  • Provisioning: Manual - requires creating both PV and PVC
  • Pattern: See "Static PV Pattern" section below

Example: Media storage at /truenas/media

apiVersion: v1
kind: PersistentVolume
metadata:
  name: media-cephfs-pv
spec:
  capacity:
    storage: 100Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/media

RBD Block Storage

Storage Classes: ceph-rbd, ceph-bulk

High-performance block storage using Ceph RADOS Block Devices.

Characteristics:

  • Access Mode: ReadWriteOnce (RWO) - Single pod exclusive access
  • Performance: Superior to CephFS for block workloads (databases, etc.)
  • Thin Provisioning: Efficient storage allocation
  • Features: Snapshots, clones, fast resizing

Use Cases:

  • PostgreSQL and other databases
  • Stateful applications requiring block storage
  • Applications needing high IOPS
  • Workloads migrating from OpenEBS Mayastor

Storage Classes:

  • ceph-rbd: General-purpose RBD storage
  • ceph-bulk: Erasure-coded pool for large, less-critical data

Legacy Unraid NFS Storage (Being Decommissioned)

Storage Class: used-nfs (no storage class for static tower/tower-2 PVs)

Legacy NFS storage from Unraid servers, currently being migrated to Ceph.

Servers:

  • tower.manor - Primary Unraid server (100Ti NFS) - Decommissioning
  • tower-2.manor - Secondary Unraid server (100Ti NFS) - Decommissioning

Current Status:

  • Some media applications still use hybrid approach during migration
  • Active data migration to CephFS in progress
  • Will be fully retired once migration complete

Migration Plan: All workloads being moved to Ceph (CephFS or RBD as appropriate)

Storage Provisioning Patterns

Dynamic Provisioning (Default)

For most applications, simply create a PVC and Kubernetes will automatically provision storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-data
  namespace: my-namespace
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  # No storageClassName specified = uses default (cephfs-shared)

Static PV Pattern

For mounting pre-existing CephFS paths:

Step 1: Create PersistentVolume

apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-static-pv
spec:
  capacity:
    storage: 5Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/my-data  # Pre-existing path in CephFS

Step 2: Create matching PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-static-pvc
  namespace: my-namespace
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Ti
  storageClassName: cephfs-static
  volumeName: my-static-pv

Current Static PVs in Use:

  • media-cephfs-pv/truenas/media (100Ti)
  • minio-cephfs-pv/truenas/minio (10Ti)
  • paperless-cephfs-pv/truenas/paperless (5Ti)

Storage Decision Matrix

Workload TypeStorage ClassAccess ModeRationale
Databases (PostgreSQL, etc.)ceph-rbdRWOBest performance for block storage workloads
Media Librariescephfs-static or cephfs-sharedRWXShared access for media servers
Media Downloadscephfs-sharedRWXMulti-pod write access
Application Data (single pod)ceph-rbdRWOHigh performance block storage
Application Data (multi-pod)cephfs-sharedRWXConcurrent access required
Backup Repositoriescephfs-sharedRWXVolSync requires RWX
Shared Configcephfs-sharedRWXMultiple pods need access
Bulk Storageceph-bulk or cephfs-staticRWO/RWXLarge datasets, erasure coding
Legacy Apps (during migration)used-nfsRWXTemporary until Unraid decom complete

Backup Strategy

VolSync with CephFS

All persistent data is backed up using VolSync, which now uses CephFS for its repository storage:

  • Backup Frequency: Hourly snapshots via ReplicationSource
  • Repository Storage: CephFS PVC (migrated from NFS)
  • Backend: Restic repositories on CephFS
  • Retention: Configurable per-application
  • Recovery: Supports restore to same or different PVC

VolSync Repository Location: /repository/{APP} on CephFS

Network Configuration

Ceph Networks

The external Ceph cluster uses two networks:

  • Public Network: 10.150.0.0/24
    • Client connections from Kubernetes
    • Ceph monitor communication
    • Used by CSI drivers
  • Cluster Network: 10.200.0.0/24
    • OSD-to-OSD replication
    • Not directly accessed by Kubernetes

Connection Method

Kubernetes connects to Ceph via:

  1. Rook Operator: Manages connection to external cluster
  2. CSI Drivers: cephfs.csi.ceph.com for CephFS volumes
  3. Mon Endpoints: ConfigMap with Ceph monitor addresses
  4. Authentication: Ceph client.kubernetes credentials

Performance Characteristics

CephFS Performance

  • Sequential Read: Excellent (limited by network, ~10 Gbps)
  • Sequential Write: Very Good (COW overhead, CRUSH rebalancing)
  • Random I/O: Good (shared filesystem overhead)
  • Concurrent Access: Excellent (native RWX support)
  • Metadata Operations: Good (dedicated MDS servers)

Optimization Tips

  1. Use RWO when possible: Even on CephFS, specify RWO if no sharing needed
  2. Size appropriately: CephFS handles small and large files well
  3. Monitor MDS health: CephFS performance depends on MDS responsiveness
  4. Enable client caching: Default CSI settings enable attribute caching

Storage Operations

Common Operations

Expand a PVC:

kubectl patch pvc my-pvc -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Check storage usage:

kubectl get pvc -A
kubectl exec -it <pod> -- df -h

Monitor Ceph cluster health:

kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

List CephFS mounts:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status

Troubleshooting

PVC stuck in Pending:

kubectl describe pvc <pvc-name>
kubectl -n rook-ceph logs -l app=rook-ceph-operator

Slow performance:

# Check Ceph cluster health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail

# Check MDS status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status

# Check OSD performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf

Mount issues:

# Check CSI driver logs
kubectl -n rook-ceph logs -l app=csi-cephfsplugin

# Verify connection to monitors
kubectl -n rook-ceph get configmap rook-ceph-mon-endpoints -o yaml

Current Migration Status

Completed

  • ✅ RBD storage classes implemented and available
  • ✅ CephFS as default storage class
  • ✅ VolSync migrated to CephFS backend
  • ✅ Static PV pattern established for existing data
  • ✅ Migrated from OpenEBS Mayastor to Ceph RBD

In Progress

  • 🔄 Decommissioning Unraid NFS servers (tower/tower-2)
  • 🔄 Migrating remaining media workloads from NFS to CephFS
  • 🔄 Consolidating all storage onto Ceph platform

Future Enhancements

  • 📋 Additional RBD pool with SSD backing for critical workloads
  • 📋 Erasure coding optimization for bulk media storage
  • 📋 Advanced snapshot scheduling and retention policies
  • 📋 Ceph performance tuning and optimization

Best Practices

Storage Selection

  1. Databases and single-pod apps: Use ceph-rbd for best performance
  2. Shared storage needs: Use cephfs-shared for RWX access
  3. Use static PVs for existing data: Don't duplicate large datasets
  4. Specify requests accurately: Helps with capacity planning
  5. Choose appropriate access modes: RWO for RBD, RWX for CephFS

Capacity Planning

  1. Monitor Ceph cluster capacity:
    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
    
  2. Set appropriate PVC sizes: CephFS supports expansion
  3. Plan for growth: Ceph cluster can scale by adding OSDs
  4. Regular capacity reviews: Check usage trends

Data Protection

  1. Enable VolSync: For all stateful applications
  2. Test restores regularly: Ensure backup viability
  3. Monitor backup success: Check ReplicationSource status
  4. Retain snapshots appropriately: Balance storage cost vs recovery needs

Security

  1. Use namespace isolation: PVCs are namespace-scoped
  2. Limit access with RBAC: Control who can create PVCs
  3. Monitor access patterns: Unusual I/O may indicate issues
  4. Rotate Ceph credentials: Periodically update client keys

Monitoring and Observability

Key Metrics

Monitor these metrics via Prometheus/Grafana:

  • Ceph cluster health status
  • OSD utilization and performance
  • MDS cache hit rates
  • PVC capacity usage
  • CSI operation latencies
  • VolSync backup success rates

Alerts

Critical alerts configured:

  • Ceph cluster health warnings
  • High OSD utilization (>80%)
  • MDS performance degradation
  • PVC approaching capacity
  • VolSync backup failures

References

  • Rook Documentation: rook.io/docs
  • Ceph Documentation: docs.ceph.com
  • Local Setup: kubernetes/apps/rook-ceph/README.md
  • Storage Classes: kubernetes/apps/rook-ceph/rook-ceph-cluster/app/storageclasses.yaml

Media Applications

Media Stack Overview

The media stack provides automated media management and streaming services using the *arr suite of applications and Plex Media Server.

graph TD
    subgraph Content Acquisition
        SONARR[Sonarr - TV Shows]
        SONARR_UHD[Sonarr UHD - 4K TV]
        RADARR[Radarr - Movies]
        RADARR_UHD[Radarr UHD - 4K Movies]
        BAZARR[Bazarr - Subtitles]
        BAZARR_UHD[Bazarr UHD - 4K Subtitles]
    end

    subgraph Download Clients
        SABNZBD[SABnzbd - Usenet]
        NZBGET[NZBget - Usenet Alt]
    end

    subgraph Media Storage
        CEPHFS[CephFS Static PV<br>/truenas/media]
        TOWER[Tower NFS<br>/mnt/user]
        TOWER2[Tower-2 NFS<br>/mnt/user]
    end

    subgraph Media Server
        PLEX[Plex Media Server]
        TAUTULLI[Tautulli - Analytics]
        OVERSEERR[Overseerr - Requests]
    end

    subgraph Post Processing
        TDARR[Tdarr - Transcoding]
        KOMETA[Kometa - Metadata]
    end

    SONARR --> SABNZBD
    RADARR --> SABNZBD
    SABNZBD --> CEPHFS
    SABNZBD --> TOWER
    SABNZBD --> TOWER2
    CEPHFS --> PLEX
    TOWER --> PLEX
    TOWER2 --> PLEX
    PLEX --> TAUTULLI
    OVERSEERR --> SONARR
    OVERSEERR --> RADARR
    TDARR --> CEPHFS
    KOMETA --> PLEX

Storage Configuration

Primary Media Library - CephFS

The main media library is stored on CephFS using a static PV that mounts the pre-existing /truenas/media directory.

Configuration: kubernetes/apps/media/storage/app/media-cephfs-pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: media-cephfs-pv
spec:
  capacity:
    storage: 100Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/media

Mount Pattern: Applications mount this PVC at /media or specific subdirectories:

  • /media/downloads - Download staging area
  • /media/tv - TV show library
  • /media/movies - Movie library
  • /media/music - Music library
  • /media/books - Book library

Benefits:

  • ReadWriteMany: Multiple pods can access simultaneously
  • High Performance: Direct CephFS access, no NFS overhead
  • Shared Access: All media apps see the same filesystem
  • Snapshots: VolSync backups protect the data

Legacy NFS Mounts (Unraid)

Download clients and some media applications use legacy NFS mounts from Unraid servers alongside CephFS.

Servers:

  • tower.manor - Primary Unraid server (100Ti NFS)
  • tower-2.manor - Secondary Unraid server (100Ti NFS)

Current Usage:

  • SABnzbd downloads to all three storage backends (CephFS, tower, tower-2)
  • Plex reads media from all three storage backends
  • Active downloads and in-progress media on Unraid
  • Organized/completed media on CephFS

Status: Legacy - gradual migration to CephFS in progress

Configuration: Static PVs without storage class

apiVersion: v1
kind: PersistentVolume
metadata:
  name: media-tower-pv
spec:
  capacity:
    storage: 100Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    server: tower.manor
    path: /mnt/user
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: media-tower-2-pv
spec:
  capacity:
    storage: 100Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  nfs:
    server: tower-2.manor
    path: /mnt/user

Core Components

Media Server

Plex Media Server

Namespace: media Purpose: Media streaming and library management

Plex is the primary media server, providing:

  • Streaming to multiple devices
  • Hardware transcoding (Intel Quick Sync)
  • Library organization and metadata
  • User management and sharing
  • Remote access

Configuration: kubernetes/apps/media/plex/app/helmrelease.yaml

Storage Mounts (Multi-backend):

  • Media library (CephFS): CephFS static PV at /media
  • Media library (Tower): Tower NFS at /tower
  • Media library (Tower-2): Tower-2 NFS at /tower-2
  • Configuration: CephFS dynamic PVC (10Gi)
  • Transcoding cache: EmptyDir (temporary)

Library Configuration:

  • Plex libraries configured to scan all three storage backends
  • Unified library view across CephFS and Unraid storage

Resource Allocation:

resources:
  requests:
    cpu: 2000m
    memory: 4Gi
    gpu.intel.com/i915: 1
  limits:
    cpu: 8000m
    memory: 16Gi
    gpu.intel.com/i915: 1

Hardware Acceleration: Intel Quick Sync enabled for transcoding

Tautulli

Namespace: media Purpose: Plex analytics and monitoring

Provides:

  • Watch history and statistics
  • User activity monitoring
  • Notification triggers
  • API for automation

Storage: CephFS dynamic PVC (5Gi) for database and logs

Content Acquisition (*arr Suite)

Sonarr / Sonarr UHD

Purpose: TV show automation

  • Sonarr: Standard quality TV shows
  • Sonarr UHD: 4K/UHD TV shows

Features:

  • TV series tracking and monitoring
  • Episode search and download
  • Quality profiles and upgrades
  • Calendar and schedule tracking

Storage:

  • Configuration: CephFS dynamic PVC (10Gi)
  • Media access: CephFS static PV (shared /media)

Radarr / Radarr UHD

Purpose: Movie automation

  • Radarr: Standard quality movies
  • Radarr UHD: 4K/UHD movies

Features:

  • Movie library management
  • Automated downloads
  • Quality management
  • List integration (IMDb, Trakt)

Storage:

  • Configuration: CephFS dynamic PVC (10Gi)
  • Media access: CephFS static PV (shared /media)

Bazarr / Bazarr UHD

Purpose: Subtitle management

Automated subtitle downloading for:

  • TV shows (via Sonarr integration)
  • Movies (via Radarr integration)
  • Multiple languages
  • Subtitle providers

Storage: CephFS dynamic PVC (5Gi)

Download Clients

SABnzbd

Namespace: media Purpose: Primary Usenet download client

Features:

  • NZB file processing
  • Automated post-processing
  • Category-based handling
  • Integration with *arr apps

Storage Mounts (Multi-backend):

  • Configuration: CephFS dynamic PVC (5Gi)
  • Downloads (CephFS): CephFS static PV /media/downloads/usenet
  • Downloads (Tower): Tower NFS /tower/downloads/usenet
  • Downloads (Tower-2): Tower-2 NFS /tower-2/downloads/usenet
  • Incomplete: CephFS dynamic PVC (temporary downloads)

Download Strategy:

  • Categories route to different storage backends
  • Active downloads use appropriate backend based on category
  • Completed downloads moved to final library location

Post-Processing: Automatically moves completed downloads to appropriate media folders

NZBget

Namespace: media Purpose: Alternative Usenet client

Lightweight alternative to SABnzbd for specific use cases.

Storage: Similar pattern to SABnzbd

Post-Processing

Tdarr

Purpose: Media transcoding and file optimization

Components:

  1. Tdarr Server: Manages transcoding queue
  2. Tdarr Node: CPU-based transcoding workers
  3. Tdarr Node GPU: GPU-accelerated transcoding

Use Cases:

  • Convert media to h265/HEVC
  • Reduce file sizes
  • Standardize formats
  • Remove unwanted audio/subtitle tracks

Storage:

  • Configuration: CephFS dynamic PVC (25Gi)
  • Media access: CephFS static PV (shared /media)
  • Transcode cache: CephFS dynamic PVC (100Gi)

Resource Intensive: Uses significant CPU/GPU resources during transcoding

Kometa (formerly Plex Meta Manager)

Purpose: Enhanced Plex metadata and collections

Features:

  • Automated collections (e.g., "Top Rated 2023")
  • Poster and artwork management
  • Rating and tag synchronization
  • Scheduled metadata updates

Storage: CephFS dynamic PVC (5Gi) for configuration

User Management

Overseerr

Namespace: media Purpose: Media request management

User-facing application for:

  • Media requests (movies/TV shows)
  • Request approval workflow
  • User quotas and limits
  • Integration with Sonarr/Radarr

Authentication: Integrated with Plex accounts

Storage: CephFS dynamic PVC (5Gi)

Network Configuration

Internal Access

All media applications are accessible via internal DNS:

spec:
  ingressClassName: internal
  hosts:
    - host: plex.chelonianlabs.com
      paths:
        - path: /
          pathType: Prefix

External Access

Plex is accessible externally via:

  • Cloudflared tunnel for secure access
  • Direct access on port 32400 (firewall controlled)

Service Discovery

Applications discover each other via Kubernetes services:

  • sonarr.media.svc.cluster.local:8989
  • radarr.media.svc.cluster.local:7878
  • sabnzbd.media.svc.cluster.local:8080
  • plex.media.svc.cluster.local:32400

Backup Strategy

Application Configuration

All *arr application configurations are backed up via VolSync:

Backup Schedule: Hourly Retention:

  • Hourly: 24 snapshots
  • Daily: 7 snapshots
  • Weekly: 4 snapshots

Backup Pattern:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: sonarr
  namespace: media
spec:
  sourcePVC: sonarr-config
  trigger:
    schedule: "0 * * * *"
  restic:
    repository: sonarr-restic-secret
    retain:
      hourly: 24
      daily: 7
      weekly: 4

Media Library

Media files are NOT backed up via VolSync due to size (100Ti+)

Protection Strategy:

  • Ceph replication (3x copies across OSDs)
  • Replaceable content (can be re-downloaded)
  • Critical media manually backed up externally

Configuration Backup: All *arr databases and settings are backed up

Resource Management

Resource Allocation Strategy

Media applications have varying resource needs:

High Resource:

  • Plex: 2-8 CPU, 4-16Gi RAM, GPU for transcoding
  • Tdarr: 4-16 CPU, 8-32Gi RAM, GPU optional

Medium Resource:

  • Sonarr/Radarr: 500m-2 CPU, 512Mi-2Gi RAM
  • SABnzbd: 1-4 CPU, 1-4Gi RAM

Low Resource:

  • Bazarr: 100m-500m CPU, 128Mi-512Mi RAM
  • Overseerr: 100m-500m CPU, 256Mi-1Gi RAM

Storage Quotas

Dynamic PVCs sized appropriately:

  • Configuration: 5-10Gi (databases, logs)
  • Download buffers: 100Gi (temporary downloads)
  • Transcode cache: 100Gi (Tdarr working space)

Maintenance

Regular Tasks

Weekly:

  • Review failed downloads
  • Check disk space usage
  • Verify backup completion
  • Update metadata (Kometa)

Monthly:

  • Library maintenance (Plex)
  • Database optimization (*arr apps)
  • Review and cleanup old downloads
  • Check for application updates (Renovate handles this)

Health Monitoring

Key Metrics:

  • Plex stream count and transcoding sessions
  • SABnzbd download queue and speed
  • *arr indexer health and search failures
  • Storage capacity and growth rate

Alerts:

  • Download failures
  • Indexer connectivity issues
  • Storage capacity warnings
  • Failed backup jobs

Troubleshooting

Common Issues

Plex can't see media files:

# Check PVC mount
kubectl exec -n media deployment/plex -- ls -la /media

# Verify permissions
kubectl exec -n media deployment/plex -- ls -ld /media/movies /media/tv

# Check Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Downloads not moving to library:

# Check SABnzbd logs
kubectl logs -n media deployment/sabnzbd --tail=100

# Verify shared storage access
kubectl exec -n media deployment/sabnzbd -- ls -la /media/downloads/usenet

# Check Sonarr/Radarr import
kubectl logs -n media deployment/sonarr --tail=100 | grep -i import

Slow transcoding:

# Verify GPU allocation
kubectl describe pod -n media -l app.kubernetes.io/name=plex | grep -A5 "Limits\|Requests"

# Check GPU utilization (on node)
intel_gpu_top

# Review transcode logs
kubectl logs -n media deployment/plex | grep -i transcode

Storage full:

# Check PVC usage
kubectl get pvc -n media

# Check storage usage in pod
kubectl exec -n media deployment/plex -- df -h | grep media

# Check Ceph cluster capacity
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df

Best Practices

Storage Organization

Directory Structure:

/media/
├── downloads/
│   ├── usenet/          # SABnzbd downloads
│   └── complete/        # Completed downloads
├── movies/              # Radarr managed
│   ├── 4k/             # UHD content
│   └── 1080p/          # HD content
├── tv/                  # Sonarr managed
│   ├── 4k/
│   └── 1080p/
├── music/
└── books/

Quality Profiles

  • Use separate instances for 4K content (Sonarr UHD, Radarr UHD)
  • Configure appropriate quality cutoffs
  • Enable upgrades for better releases
  • Set size limits to prevent excessive downloads

Download Management

  • Configure category-based post-processing in SABnzbd
  • Use download client categories in *arr apps
  • Enable completed download handling
  • Set appropriate retention for download history

Performance Optimization

  • Use hardware transcoding (Intel Quick Sync)
  • Pre-optimize media with Tdarr (h265/HEVC)
  • Adjust Plex transcoder quality settings
  • Enable Plex optimize versions for common devices

Security Considerations

Access Control

  • Internal Network Only: Media apps exposed only via internal ingress
  • Authentication Required: All apps require login
  • Plex Managed Auth: User access controlled via Plex sharing
  • Overseerr Integration: Request permissions via Plex accounts

API Keys

  • All API keys stored in Kubernetes secrets
  • External Secrets integration with Infisical
  • Regular key rotation via automation
  • Least privilege access between services

Future Improvements

Planned Enhancements

  • GPU Transcoding Pool: Dedicated GPU nodes for Tdarr
  • Request Automation: Auto-approve for trusted users
  • Advanced Monitoring: Grafana dashboards for media metrics
  • Content Analysis: Automated duplicate detection
  • Unraid Migration: Gradual migration of tower/tower-2 NFS storage to CephFS
    • Currently using hybrid approach (CephFS + tower + tower-2)
    • Plan: Consolidate all media storage to CephFS
    • Timeline: When Unraid servers are decommissioned

Under Consideration

  • Jellyfin: Alternative media server for comparison
  • Prowlarr: Unified indexer management
  • Readarr: Book management automation
  • Lidarr: Music management automation

References

  • Media Storage: kubernetes/apps/media/storage/
  • Plex: kubernetes/apps/media/plex/
  • Sonarr: kubernetes/apps/media/sonarr/
  • Radarr: kubernetes/apps/media/radarr/
  • SABnzbd: kubernetes/apps/media/sabnzbd/
  • Storage Architecture: docs/src/architecture/storage.md

Network

Storage Applications

This document covers the storage-related applications and services running in the cluster.

Storage Stack Overview

graph TD
    subgraph External Infrastructure
        CEPH[Proxmox Ceph Cluster]
    end

    subgraph Kubernetes Storage Layer
        ROOK[Rook Ceph Operator]
        CSI[Ceph CSI Drivers]
        VOLSYNC[VolSync]
    end

    subgraph Storage Consumers
        APPS[Applications]
        BACKUPS[Backup Repositories]
    end

    CEPH --> ROOK
    ROOK --> CSI
    CSI --> APPS
    VOLSYNC --> BACKUPS
    CSI --> VOLSYNC

Core Components

Rook Ceph Operator

Namespace: rook-ceph Type: Helm Release Purpose: Manages connection to external Ceph cluster and provides CSI drivers

The Rook operator is the bridge between Kubernetes and the external Ceph cluster. It:

  • Manages CSI driver deployments
  • Maintains connection to Ceph monitors
  • Handles authentication and secrets
  • Provides CephFS filesystem access

Configuration: kubernetes/apps/rook-ceph/rook-ceph-operator/app/helmrelease.yaml

Current Setup:

  • CephFS Driver: Enabled ✅
  • RBD Driver: Disabled (Phase 2)
  • Connection Mode: External cluster
  • Network: Public network 10.150.0.0/24

Key Resources:

# Check operator status
kubectl -n rook-ceph get pods -l app=rook-ceph-operator

# View operator logs
kubectl -n rook-ceph logs -l app=rook-ceph-operator -f

# Check CephCluster resource
kubectl -n rook-ceph get cephcluster

Rook Ceph Cluster Configuration

Namespace: rook-ceph Type: CephCluster Custom Resource Purpose: Defines external Ceph cluster connection

Configuration: kubernetes/apps/rook-ceph/rook-ceph-cluster/app/cluster-external.yaml

This resource tells Rook how to connect to the external Ceph cluster:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  external:
    enable: true
  dataDirHostPath: /var/lib/rook
  cephVersion:
    image: quay.io/ceph/ceph:v18

Monitor Configuration: Defined in ConfigMap rook-ceph-mon-endpoints

  • Contains Ceph monitor IP addresses
  • Critical for cluster connectivity
  • Automatically referenced by CSI drivers

Authentication: Stored in Secret rook-ceph-mon

  • Contains client.kubernetes Ceph credentials
  • Encrypted with SOPS
  • Referenced by all CSI operations

Ceph CSI Drivers

Namespace: rook-ceph Type: DaemonSet (nodes) + Deployment (provisioner) Purpose: Enable Kubernetes to mount CephFS volumes

Components:

  1. csi-cephfsplugin (DaemonSet)

    • Runs on every node
    • Mounts CephFS volumes to pods
    • Handles node-level operations
  2. csi-cephfsplugin-provisioner (Deployment)

    • Creates/deletes CephFS subvolumes
    • Handles dynamic provisioning
    • Manages volume expansion

Monitoring:

# Check CSI pods
kubectl -n rook-ceph get pods -l app=csi-cephfsplugin

# View CSI driver logs
kubectl -n rook-ceph logs -l app=csi-cephfsplugin -c csi-cephfsplugin

# Check provisioner
kubectl -n rook-ceph get pods -l app=csi-cephfsplugin-provisioner

Storage Classes

Configuration: kubernetes/apps/rook-ceph/rook-ceph-cluster/app/storageclasses.yaml

cephfs-shared (Default)

Primary storage class for all dynamic provisioning:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs-shared
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: cephfs
  pool: cephfs_data
allowVolumeExpansion: true
reclaimPolicy: Delete

Usage: Default for all PVCs without explicit storageClassName

cephfs-static

For mounting pre-existing CephFS directories:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs-static
provisioner: rook-ceph.cephfs.csi.ceph.com
# Used with manually created PVs pointing to existing paths

Usage: Requires manual PV creation, see examples below

VolSync

Namespace: storage Type: Helm Release Purpose: Backup and recovery for Persistent Volume Claims

VolSync provides automated backup of all stateful applications using Restic.

Configuration: kubernetes/apps/storage/volsync/app/helmrelease.yaml

Backup Repository: CephFS-backed PVC

  • Location: volsync-cephfs-pvc (5Ti)
  • Path: /repository/{APP}/ for each application
  • Previous: NFS on vault.manor (migrated to CephFS)

How It Works:

  1. Applications create ReplicationSource resources
  2. VolSync creates backup pods with mover containers
  3. Mover mounts both application PVC and repository PVC
  4. Restic backs up data to repository
  5. Retention policies keep configured snapshot count

Backup Pattern:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: my-app
  namespace: my-namespace
spec:
  sourcePVC: my-app-data
  trigger:
    schedule: "0 * * * *"  # Hourly
  restic:
    repository: my-app-restic-secret
    retain:
      hourly: 24
      daily: 7
      weekly: 4

Common Operations:

# Manual backup trigger
task volsync:snapshot NS=<namespace> APP=<app>

# List snapshots
task volsync:run NS=<namespace> REPO=<app> -- snapshots

# Unlock repository (if locked)
task volsync:unlock-local NS=<namespace> APP=<app>

# Restore to new PVC
task volsync:restore NS=<namespace> APP=<app>

Repository PVC Configuration: kubernetes/apps/storage/volsync/app/volsync-cephfs-pv.yaml

Static PV Examples

Media Storage

Large media library mounted from pre-existing CephFS path:

Location: kubernetes/apps/media/storage/app/media-cephfs-pv.yaml

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: media-cephfs-pv
spec:
  capacity:
    storage: 100Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/media
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: media-cephfs-pvc
  namespace: media
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Ti
  storageClassName: cephfs-static
  volumeName: media-cephfs-pv

Usage: Mounted by Plex, Sonarr, Radarr, etc. for media library access

Minio Object Storage

Minio data stored on CephFS:

Location: kubernetes/apps/storage/minio/app/minio-cephfs-pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: minio-cephfs-pv
spec:
  capacity:
    storage: 10Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/minio

Paperless-ngx Document Storage

Document management system storage:

Location: kubernetes/apps/selfhosted/paperless-ngx/app/paperless-cephfs-pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: paperless-cephfs-pv
spec:
  capacity:
    storage: 5Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/paperless

Storage Operations

Creating a New Static PV

Step 1: Create directory in CephFS (on Proxmox Ceph node)

# SSH to a Proxmox node with Ceph access
mkdir -p /mnt/cephfs/truenas/my-app
chmod 777 /mnt/cephfs/truenas/my-app  # Or appropriate permissions

Step 2: Create PV manifest

apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-app-cephfs-pv
spec:
  capacity:
    storage: 1Ti
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: cephfs-static
  csi:
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-static
      namespace: rook-ceph
    volumeAttributes:
      clusterID: rook-ceph
      fsName: cephfs
      staticVolume: "true"
      rootPath: /truenas/my-app

Step 3: Create PVC manifest

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-cephfs-pvc
  namespace: my-namespace
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti
  storageClassName: cephfs-static
  volumeName: my-app-cephfs-pv

Step 4: Apply and verify

kubectl apply -f pv.yaml
kubectl apply -f pvc.yaml
kubectl get pv my-app-cephfs-pv
kubectl get pvc -n my-namespace my-app-cephfs-pvc

Expanding a PVC

CephFS supports online volume expansion:

# Edit PVC to increase size
kubectl patch pvc my-pvc -n my-namespace -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

# Verify expansion
kubectl get pvc -n my-namespace my-pvc -w

Note: Size can only increase, not decrease

Troubleshooting Mount Issues

PVC stuck in Pending:

# Check PVC events
kubectl describe pvc -n <namespace> <pvc-name>

# Check CSI driver logs
kubectl -n rook-ceph logs -l app=csi-cephfsplugin -c csi-cephfsplugin --tail=100

# Verify storage class exists
kubectl get sc cephfs-shared

Pod can't mount volume:

# Check pod events
kubectl describe pod -n <namespace> <pod-name>

# Verify Ceph cluster connectivity
kubectl -n rook-ceph get cephcluster

# Check Ceph health
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

# Verify CephFS is available
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status

Slow I/O performance:

# Check MDS performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status

# Check OSD performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf

# Identify slow operations
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail

Monitoring and Alerts

Key Metrics

Monitor these via Prometheus/Grafana:

  1. Storage Capacity

    • Ceph cluster utilization
    • Individual PVC usage
    • Growth trends
  2. Performance

    • CSI operation latency
    • MDS cache hit ratio
    • OSD I/O rates
  3. Reliability

    • VolSync backup success rate
    • Ceph health status
    • CSI driver availability

Useful Queries

Check all PVCs by size:

kubectl get pvc -A -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,SIZE:.spec.resources.requests.storage,STORAGECLASS:.spec.storageClassName --sort-by=.spec.resources.requests.storage

Find PVCs using old storage classes:

kubectl get pvc -A -o json | jq -r '.items[] | select(.spec.storageClassName == "nfs-csi" or .spec.storageClassName == "mayastor-etcd-localpv") | "\(.metadata.namespace)/\(.metadata.name) - \(.spec.storageClassName)"'

Check Ceph cluster capacity:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df

Monitor VolSync backups:

# Check all ReplicationSources
kubectl get replicationsource -A

# Check specific backup status
kubectl get replicationsource -n <namespace> <app> -o jsonpath='{.status.lastSyncTime}'

Backup and Recovery

VolSync Backup Workflow

  1. Application creates ReplicationSource
  2. VolSync creates backup pod (every hour by default)
  3. Restic backs up PVC to repository
  4. Snapshots retained per retention policy
  5. Status updated in ReplicationSource

Restore Procedures

Restore to original PVC:

# Scale down application
kubectl scale deployment -n <namespace> <app> --replicas=0

# Run restore
task volsync:restore NS=<namespace> APP=<app>

# Scale up application
kubectl scale deployment -n <namespace> <app> --replicas=1

Restore to new PVC:

  1. Create ReplicationDestination pointing to new PVC
  2. VolSync will restore data from repository
  3. Update application to use new PVC
  4. Verify data integrity

Disaster Recovery

Complete cluster rebuild:

  1. Deploy new Kubernetes cluster
  2. Install Rook with same external Ceph connection
  3. Recreate storage classes
  4. Deploy VolSync
  5. Restore all applications from backups

CephFS corruption:

  1. Check Ceph health and repair if possible
  2. If unrecoverable, restore from VolSync backups
  3. VolSync repository is on CephFS, so ensure repository is intact
  4. Consider external backup of VolSync repository

Security Considerations

Ceph Authentication

  • Client Key: client.kubernetes Ceph user
  • Permissions: Limited to CephFS pools only
  • Storage: SOPS-encrypted in rook-ceph-mon secret
  • Rotation: Should be rotated periodically

PVC Access Control

  • Namespace Isolation: PVCs are namespace-scoped
  • RBAC: Control who can create/delete PVCs
  • Pod Security: Pods must have appropriate security context
  • Network Policies: Limit which pods can access storage

Backup Security

  • VolSync Repository: Protected by Kubernetes RBAC
  • Restic Encryption: Repository encryption with per-app keys
  • Snapshot Access: Controlled via ReplicationSource ownership

Future Enhancements (Phase 2)

RBD Block Storage

When Mayastor hardware is repurposed:

  1. Enable RBD driver in Rook operator
  2. Create RBD pools on Ceph cluster:
    • ssd-db - Critical workloads
    • rook-pvc-pool - General purpose
    • media-bulk - Erasure-coded bulk storage
  3. Deploy RBD storage classes
  4. Migrate workloads based on performance requirements

Planned Improvements

  • Ceph dashboard integration
  • Advanced monitoring dashboards
  • Automated capacity alerts
  • Storage QoS policies
  • Cross-cluster replication

References

  • Rook Operator: kubernetes/apps/rook-ceph/rook-ceph-operator/
  • Cluster Config: kubernetes/apps/rook-ceph/rook-ceph-cluster/
  • Storage Classes: kubernetes/apps/rook-ceph/rook-ceph-cluster/app/storageclasses.yaml
  • VolSync: kubernetes/apps/storage/volsync/
  • Architecture: docs/src/architecture/storage.md

Observability

Installation Guide

Prerequisites

graph TD
    subgraph Hardware
        CP[Control Plane Nodes]
        GPU[GPU Worker Node]
        Worker[Worker Nodes]
    end

    subgraph Software
        OS[Operating System]
        Tools[Required Tools]
        Network[Network Setup]
    end

    subgraph Configuration
        Git[Git Repository]
        Secrets[SOPS Setup]
        Certs[Certificates]
    end

Hardware Requirements

Control Plane Nodes (3x)

  • CPU: 4 cores per node
  • RAM: 16GB per node
  • Role: Cluster control plane

GPU Worker Node (1x)

  • CPU: 16 cores
  • RAM: 128GB
  • GPU: 4x NVIDIA Tesla P100
  • Role: GPU-accelerated workloads

Worker Nodes (2x)

  • CPU: 16 cores per node
  • RAM: 128GB per node
  • Role: General workloads

Software Prerequisites

  1. Operating System

    • Linux distribution
    • Updated system packages
    • Required kernel modules
    • NVIDIA drivers (for GPU node)
  2. Required Tools

    • kubectl
    • flux
    • SOPS
    • age/gpg
    • task

Initial Setup

1. Repository Setup

# Clone the repository
git clone https://github.com/username/dapper-cluster.git
cd dapper-cluster

# Create configuration
cp config.sample.yaml config.yaml

2. Configuration

graph LR
    Config[Configuration] --> Secrets[Secrets Management]
    Config --> Network[Network Settings]
    Config --> Storage[Storage Setup]
    Secrets --> SOPS[SOPS Encryption]
    Network --> DNS[DNS Setup]
    Storage --> CSI[CSI Drivers]

Edit Configuration

cluster:
  name: dapper-cluster
  domain: example.com

network:
  cidr: 10.0.0.0/16

storage:
  ceph:
    # External Ceph cluster connection
    # Configured via Rook operator after bootstrap
    monitors: []  # Set during Rook deployment

3. Secrets Management

  • Generate age key
  • Configure SOPS
  • Encrypt sensitive files

4. Bootstrap Process

graph TD
    Start[Start Installation] --> CP[Bootstrap Control Plane]
    CP --> Workers[Join Worker Nodes]
    Workers --> GPU[Configure GPU Node]
    GPU --> Flux[Install Flux]
    Flux --> Apps[Deploy Apps]

Bootstrap Commands

# Initialize flux
task flux:bootstrap

# Verify installation
task cluster:verify

# Verify GPU support
kubectl get nodes -o wide
nvidia-smi # on GPU node

Post-Installation

1. Verify Components

  • Check control plane health
  • Verify worker node status
  • Test GPU functionality
  • Check Rook Ceph connection
  • Verify storage classes
  • Verify network connectivity

2. Deploy Applications

  • Deploy core services
  • Configure monitoring
  • Setup backup systems
  • Deploy GPU-enabled workloads

3. Security Setup

  • Configure network policies
  • Setup certificate management
  • Enable monitoring and alerts
  • Secure GPU access

Troubleshooting

Common installation issues and solutions:

  1. Control Plane Issues

    • Verify etcd cluster health
    • Check control plane components
    • Review system logs
  2. Worker Node Issues

    • Verify node join process
    • Check kubelet status
    • Review node logs
  3. GPU Node Issues

    • Verify NVIDIA driver installation
    • Check NVIDIA container runtime
    • Validate GPU visibility in cluster
  4. Storage Issues

    • Verify Ceph cluster connectivity
    • Check Rook operator status
    • Verify storage class configuration
    • Review CephCluster resource health
    • Check PV/PVC status
  5. Network Problems

    • Check DNS resolution
    • Verify network policies
    • Review ingress configuration

Maintenance

Regular Tasks

  1. System updates
  2. Certificate renewal
  3. Backup verification
  4. Security audits
  5. GPU driver updates

Health Checks

  • Component status
  • Resource usage
  • Storage capacity
  • Network connectivity
  • GPU health

Next Steps

After successful installation:

  1. Review Architecture Overview
  2. Configure Storage
  3. Setup Network
  4. Deploy Applications

Maintenance Guide

Maintenance Overview

graph TD
    subgraph Regular Tasks
        Updates[System Updates]
        Backups[Backup Tasks]
        Monitoring[Health Checks]
    end

    subgraph Periodic Tasks
        Audit[Security Audits]
        Cleanup[Resource Cleanup]
        Review[Config Review]
    end

    Updates --> Verify[Verification]
    Backups --> Test[Backup Testing]
    Monitoring --> Alert[Alert Response]

Regular Maintenance

Daily Tasks

  1. Monitor system health
    • Check cluster status
    • Review resource usage
    • Verify backup completion
    • Check alert status

Weekly Tasks

  1. Review system logs
  2. Check storage usage
  3. Verify backup integrity
  4. Update documentation

Monthly Tasks

  1. Security updates
  2. Certificate rotation
  3. Resource optimization
  4. Performance review

Update Procedures

Flux Updates

graph LR
    PR[Pull Request] --> Review[Review Changes]
    Review --> Test[Test Environment]
    Test --> Deploy[Deploy to Prod]
    Deploy --> Monitor[Monitor Status]

Application Updates

  1. Review release notes
  2. Test in staging if available
  3. Update flux manifests
  4. Monitor deployment
  5. Verify functionality

Backup Management

Backup Strategy

graph TD
    Apps[Applications] --> Data[Data Backup]
    Config[Configurations] --> Git[Git Repository]
    Secrets[Secrets] --> Vault[Secret Storage]

    Data --> Verify[Verification]
    Git --> Verify
    Vault --> Verify

Backup Verification

  • Regular restore testing
  • Data integrity checks
  • Recovery time objectives
  • Backup retention policy

Resource Management

Cleanup Procedures

  1. Remove unused resources

    • Orphaned PVCs
    • Completed jobs
    • Old backups
    • Unused configs
  2. Storage optimization

    • Compress old logs
    • Archive unused data
    • Clean container cache

Monitoring and Alerts

Key Metrics

  • Node health
  • Pod status
  • Resource usage
  • Storage capacity
  • Network performance

Alert Response

  1. Acknowledge alert
  2. Assess impact
  3. Investigate root cause
  4. Apply fix
  5. Document resolution

Security Maintenance

Regular Tasks

graph TD
    Audit[Security Audit] --> Review[Review Findings]
    Review --> Update[Update Policies]
    Update --> Test[Test Changes]
    Test --> Document[Document Changes]

Security Checklist

  • Review network policies
  • Check certificate expiration
  • Audit access controls
  • Review secret rotation
  • Scan for vulnerabilities

Troubleshooting Guide

Common Issues

  1. Node Problems

    • Check node status
    • Review system logs
    • Verify resource usage
    • Check connectivity
  2. Storage Issues

    • Check Ceph cluster health
    • Verify CephFS status
    • Monitor storage capacity
    • Review OSD performance
    • Check MDS responsiveness
    • Verify PVC mount status
  3. Network Problems

    • Check DNS resolution
    • Verify network policies
    • Review ingress status
    • Test connectivity

Recovery Procedures

  1. Node Recovery
# Check node status
kubectl get nodes

# Drain node for maintenance
kubectl drain node-name

# Perform maintenance
# ...

# Uncordon node
kubectl uncordon node-name
  1. Storage Recovery
# Check Ceph cluster health
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

# Check PV status
kubectl get pv

# Check PVC status
kubectl get pvc -A

# Verify storage class
kubectl get sc

# Check CephFS status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph fs status

# Check OSD status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree

Documentation

Maintenance Logs

  • Keep detailed records
  • Document changes
  • Track issues
  • Update procedures

Review Process

  1. Regular documentation review
  2. Update procedures
  3. Verify accuracy
  4. Add new sections

Best Practices

  1. Change Management

    • Use git workflow
    • Test changes
    • Document updates
    • Monitor results
  2. Resource Management

    • Regular cleanup
    • Optimize usage
    • Monitor trends
    • Plan capacity
  3. Security

    • Regular audits
    • Update policies
    • Monitor access
    • Review logs

Troubleshooting Guide

Diagnostic Workflow

graph TD
    Issue[Issue Detected] --> Triage[Triage]
    Triage --> Diagnose[Diagnose]
    Diagnose --> Fix[Apply Fix]
    Fix --> Verify[Verify]
    Verify --> Document[Document]

Common Issues

1. Cluster Health Issues

Node Problems

graph TD
    Node[Node Issue] --> Check[Check Status]
    Check --> |Healthy| Resources[Resource Issue]
    Check --> |Unhealthy| System[System Issue]
    Resources --> Memory[Memory]
    Resources --> CPU[CPU]
    Resources --> Disk[Disk]
    System --> Logs[Check Logs]
    System --> Network[Network]

Diagnosis Steps:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check system resources
kubectl top nodes
kubectl top pods --all-namespaces

# Check system logs
kubectl logs -n kube-system <pod-name>

2. Storage Issues

Volume Problems

graph LR
    PV[PV Issue] --> Status[Check Status]
    Status --> |Bound| Access[Access Issue]
    Status --> |Pending| Provision[Provisioning Issue]
    Status --> |Failed| Storage[Storage System]

Resolution Steps:

# Check PV/PVC status
kubectl get pv,pvc --all-namespaces

# Check storage class
kubectl get sc

# Check provisioner pods
kubectl get pods -n storage

3. Network Issues

Connectivity Problems

graph TD
    Net[Network Issue] --> DNS[DNS Check]
    Net --> Ingress[Ingress Check]
    Net --> Policy[Network Policy]

    DNS --> CoreDNS[CoreDNS Pods]
    Ingress --> Traefik[Traefik Logs]
    Policy --> Rules[Policy Rules]

Diagnostic Commands:

# Check DNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check ingress
kubectl get ingress --all-namespaces
kubectl describe ingress <ingress-name> -n <namespace>

4. Application Issues

Pod Problems

graph TD
    Pod[Pod Issue] --> Status[Check Status]
    Status --> |Pending| Schedule[Scheduling]
    Status --> |CrashLoop| Crash[Container Crash]
    Status --> |Error| Logs[Check Logs]

Troubleshooting Steps:

# Check pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Flux Issues

GitOps Troubleshooting

graph TD
    Flux[Flux Issue] --> Source[Source Controller]
    Flux --> Kust[Kustomize Controller]
    Flux --> Helm[Helm Controller]

    Source --> Git[Git Repository]
    Kust --> Sync[Sync Status]
    Helm --> Release[Release Status]

Resolution Steps:

# Check Flux components
flux check

# Check sources
flux get sources git
flux get sources helm

# Check reconciliation
flux get kustomizations
flux get helmreleases

Performance Issues

Resource Constraints

graph LR
    Perf[Performance] --> CPU[CPU Usage]
    Perf --> Memory[Memory Usage]
    Perf --> IO[I/O Usage]

    CPU --> Limit[Resource Limits]
    Memory --> Constraint[Memory Constraints]
    IO --> Bottleneck[I/O Bottleneck]

Analysis Commands:

# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# Check resource quotas
kubectl get resourcequota -n <namespace>

Recovery Procedures

1. Node Recovery

  1. Drain node
  2. Perform maintenance
  3. Uncordon node
  4. Verify workloads

2. Storage Recovery

  1. Backup data
  2. Fix storage issues
  3. Restore data
  4. Verify access

3. Network Recovery

  1. Check connectivity
  2. Verify DNS
  3. Test ingress
  4. Update policies

Best Practices

1. Logging

  • Maintain detailed logs
  • Set appropriate retention
  • Use structured logging
  • Enable audit logging

2. Monitoring

  • Set up alerts
  • Monitor resources
  • Track metrics
  • Use dashboards

3. Documentation

  • Document issues
  • Record solutions
  • Update procedures
  • Share knowledge

Emergency Procedures

Critical Issues

  1. Assess impact
  2. Implement temporary fix
  3. Plan permanent solution
  4. Update documentation

Contact Information

  • Maintain escalation paths
  • Keep contact list updated
  • Document response times
  • Track incidents

Network Operations Runbook

Overview

This runbook provides step-by-step procedures for common network operations, troubleshooting, and emergency recovery scenarios for the Dapper Cluster network.

Quick Reference:


Table of Contents

  1. Common Operations
  2. Troubleshooting Procedures
  3. Emergency Procedures
  4. Switch Configuration
  5. Performance Monitoring
  6. Maintenance Windows

Common Operations

Accessing Network Equipment

SSH to Switches

Brocade ICX6610:

ssh admin@192.168.1.20
# [TODO: Document default credentials location]

Arista 7050:

ssh admin@192.168.1.21
# [TODO: Document default credentials location]

Aruba S2500-48p:

ssh admin@192.168.1.26
# [TODO: Document default credentials location]

Console Access

When SSH is unavailable:

# [TODO: Document console server or direct serial access]
# Brocade: Serial settings [TODO: baud rate, etc]
# Arista: Serial settings [TODO: baud rate, etc]

Checking Switch Health

Brocade ICX6610

# Basic health check
show version
show chassis
show cpu
show memory
show log tail 50

# Temperature and power
show inline power
show environment

# Check for errors
show logging | include error
show logging | include warn

Arista 7050

# Basic health check
show version
show environment all
show processes top

# Check for errors
show logging last 100
show logging | grep -i error

Verifying VLAN Configuration

Check VLAN Assignments

Brocade:

show vlan

# Check specific VLAN
show vlan 100
show vlan 150
show vlan 200

# Check which ports are in which VLANs
show vlan ethernet 1/1/1

Arista:

show vlan

# Check VLAN details
show vlan id 200

# Show interfaces by VLAN
show interfaces status

Verify Trunk Ports

Brocade:

# Show trunk configuration
show interface brief | include Trunk

# Show specific trunk
show interface ethernet 1/1/41
show interface ethernet 1/1/42

Arista:

# Show trunk ports
show interface trunk

# Show specific interface
show interface ethernet 49
show interface ethernet 50

Brocade LAG Status

# Show all LAG groups
show lag brief

# Show specific LAG details
show lag [lag-id]

# Show which ports are in LAG
show lag | include active

# Check individual LAG port status
show interface ethernet 1/1/41
show interface ethernet 1/1/42

Expected Output When Working:

LAG "brocade-to-arista" (lag-id [X]) has 2 active ports:
  ethernet 1/1/41 (40Gb) - Active
  ethernet 1/1/42 (40Gb) - Active

Arista Port-Channel Status

# Show port-channel summary
show port-channel summary

# Show specific port-channel
show interface port-channel 1

# Check member interfaces
show interface ethernet 49 port-channel
show interface ethernet 50 port-channel

Expected Output When Working:

Port-Channel1:
  Active Ports: 2
  Et49: Active
  Et50: Active
  Protocol: LACP

Monitoring Traffic and Bandwidth

Real-Time Interface Statistics

Brocade:

# Show interface rates
show interface ethernet 1/1/41 | include rate
show interface ethernet 1/1/42 | include rate

# Show all interface statistics
show interface ethernet 1/1/41

# Monitor in real-time (if supported)
monitor interface ethernet 1/1/41

Arista:

# Show interface counters
show interface ethernet 49 counters

# Show interface rates
show interface ethernet 49 | grep rate

# Real-time monitoring
watch 1 show interface ethernet 49 counters rate

Identify Top Talkers

Brocade:

# [TODO: Document method to identify top talkers]
# May require SNMP monitoring or sFlow

Arista:

# Check interface utilization
show interface counters utilization

# If sFlow configured:
# [TODO: Document sFlow commands]

Testing Connectivity

From Your Workstation

Test Management Plane:

# Ping all management interfaces
ping -c 4 192.168.1.20  # Brocade
ping -c 4 192.168.1.21  # Arista
ping -c 4 192.168.1.26  # Aruba
ping -c 4 192.168.1.7   # Mikrotik House
ping -c 4 192.168.1.8   # Mikrotik Shop

# Test wireless bridge latency
ping -c 100 192.168.1.8 | tail -3

Test Server Network (VLAN 100):

# Test Kubernetes nodes
ping -c 4 10.100.0.40   # K8s VIP
ping -c 4 10.100.0.50   # talos-control-1
ping -c 4 10.100.0.51   # talos-control-2
ping -c 4 10.100.0.52   # talos-control-3

Test from Kubernetes Nodes:

# SSH to a Talos node (if enabled) or use kubectl exec
kubectl exec -it -n default <pod-name> -- sh

# Test connectivity
ping 10.150.0.10   # Storage network
ping 10.100.0.1    # Gateway
ping 8.8.8.8       # Internet

MTU Testing (Jumbo Frames)

Test VLAN 150/200 MTU 9000:

# From a host on VLAN 150
ping -M do -s 8972 10.150.0.10

# -M do: Don't fragment
# -s 8972: 8972 + 28 (IP+ICMP headers) = 9000

# If this fails but smaller packets work, MTU is misconfigured

Path Testing

Trace route across networks:

# From your workstation
traceroute 10.100.0.50

# Expected path (if everything is working):
# 1. Local gateway
# 2. Wireless bridge
# 3. Brocade/OPNsense
# 4. Destination

Troubleshooting Procedures

Issue: No Connectivity to Garage Switches

Symptoms:

  • Cannot ping/SSH to Brocade (192.168.1.20) or Arista (192.168.1.21)
  • Can ping Aruba switch (192.168.1.26)

Diagnosis:

  1. Test wireless bridge:

    ping 192.168.1.7   # Mikrotik House
    ping 192.168.1.8   # Mikrotik Shop
    
    • If 192.168.1.7 responds but 192.168.1.8 doesn't: Wireless link down
    • If neither respond: Mikrotik issue or config problem
  2. Check Aruba-to-Mikrotik connection:

    # SSH to Aruba
    ssh admin@192.168.1.26
    
    # Check port status for Mikrotik connection
    show interface [TODO: port ID]
    

Resolution:

If wireless bridge is down:

  1. Check Mikrotik radios web interface (192.168.1.7, 192.168.1.8)
  2. Check alignment and signal strength
  3. Verify power to both radios
  4. Check for interference (weather, obstacles)
  5. Emergency: Use physical console access to switches in garage

If Mikrotik is up but switches unreachable:

  1. Check VLAN 1 configuration on trunk ports
  2. Verify Mikrotik is not blocking traffic
  3. Check Brocade port connected to Mikrotik is up

Issue: Kubernetes Pods Can't Access Storage

Symptoms:

  • Pods stuck in ContainerCreating
  • PVC stuck in Pending
  • Errors about unable to mount CephFS

Diagnosis:

  1. Check Rook/Ceph health:

    kubectl -n rook-ceph get cephcluster
    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
    
  2. Check network connectivity from Kubernetes nodes to Ceph monitors:

    # From a Talos node or debug pod
    ping 10.150.0.10   # Test VLAN 150 connectivity
    
    # Test Ceph monitor port
    nc -zv <monitor-ip> 6789
    
  3. Verify VLAN 150 MTU:

    # Test jumbo frames
    ping -M do -s 8972 10.150.0.10
    
  4. Check CSI driver logs:

    kubectl -n rook-ceph logs -l app=csi-cephfsplugin --tail=100
    

Resolution:

If MTU mismatch:

  1. Verify MTU 9000 on all VLAN 150 interfaces
  2. Check Proxmox bridge MTU settings
  3. Check switch port MTU configuration

If connectivity issue:

  1. Check VLAN 150 is properly tagged on trunk ports
  2. Verify Proxmox host network configuration
  3. Check Brocade routing for VLAN 150

Issue: Slow Ceph Performance

Symptoms:

  • Slow pod startup times
  • High I/O latency in applications
  • Ceph health warnings about slow ops

Diagnosis:

  1. Check Ceph cluster health:

    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
    
  2. Check network bandwidth utilization:

    On Brocade (VLAN 150 - Ceph Public):

    # Check 10Gb bonds to Proxmox hosts
    show interface ethernet 1/1/[TODO: ports] | include rate
    

    On Arista (VLAN 200 - Ceph Cluster):

    # Check 40Gb links to Proxmox hosts
    show interface ethernet [TODO: ports] counters rate
    
  3. Identify bottlenecks:

    • Are 10Gb links saturated? (VLAN 150)
    • Are 40Gb links saturated? (VLAN 200)
    • Is the Brocade-Arista link saturated?

Resolution:

If Brocade-Arista link is bottleneck:

  • Primary Issue: Only one 40Gb link active (see below to enable second link)
  • Enabling second 40Gb link will double bandwidth to 80Gbps

If MTU not configured:

  • Verify MTU 9000 on VLAN 150 and 200
  • Check each hop in the path

If switch CPU is high:

  • Check for broadcast storms
  • Verify STP is working correctly
  • Look for loops in topology

Issue: Network Loop / Broadcast Storm

Symptoms:

  • Network performance severely degraded
  • High CPU usage on switches
  • Connectivity flapping
  • Massive packet rates on interfaces

Diagnosis:

  1. Check for duplicate MAC addresses:

    # Brocade
    show mac-address
    
    # Look for same MAC on multiple ports
    
  2. Check STP status:

    # Brocade
    show spanning-tree
    
    # Arista
    show spanning-tree
    
  3. Look for physical loops:

    • Review physical topology diagram
    • Check for accidental double connections
    • Known issue: Brocade-Arista 2x 40Gb links not in LAG

Resolution:

Immediate (Emergency):

  1. Disable one link causing loop:

    # On Arista (already done in current config)
    configure
    interface ethernet 50
    shutdown
    
  2. Verify spanning-tree is enabled:

    # Brocade
    show spanning-tree
    
    # If not enabled:
    configure terminal
    spanning-tree
    

Permanent Fix:

  • Configure proper LAG/port-channel (see section below)

Issue: Proxmox Host Loses Network Connectivity

Symptoms:

  • Cannot ping Proxmox host management IP
  • VMs on host also offline
  • IPMI still accessible

Diagnosis:

  1. Access via IPMI console:

    # [TODO: Document IPMI access method]
    
  2. Check bond status on Proxmox:

    # From Proxmox console
    ip link show
    
    # Check bond interfaces
    cat /proc/net/bonding/bond0
    cat /proc/net/bonding/bond1
    
  3. Check switch ports:

    # On Brocade
    show interface ethernet 1/1/[TODO: ports for this host]
    show lag [TODO: lag-id for this host]
    

Resolution:

If bond is down on Proxmox:

  1. Check physical cables
  2. Restart networking on Proxmox (WARNING: will disrupt VMs)
  3. Check switch port status

If ports down on switch:

  1. Check for error counters
  2. Re-enable port if administratively down
  3. Check for physical issues (SFP, cable)

Issue: High Latency Across Wireless Bridge

Symptoms:

  • Ping times to garage > 10ms (normally 1-2ms)
  • Slow access to services in garage
  • Packet loss

Diagnosis:

  1. Test latency:

    ping -c 100 192.168.1.8
    
    # Look at:
    # - Average latency
    # - Packet loss %
    # - Jitter (variation)
    
  2. Check Mikrotik radio status:

    • Access web interface: 192.168.1.7 and 192.168.1.8
    • Check signal strength
    • Check throughput/bandwidth utilization
    • Look for interference
  3. Test with iperf:

    # On server side (garage)
    iperf3 -s
    
    # On client side (house)
    iperf3 -c 192.168.1.8 -t 30
    
    # Should see ~1 Gbps
    

Resolution:

If signal degraded:

  1. Check for obstructions (trees, weather)
  2. Check alignment
  3. Check for interference sources
  4. Consider backup link or failover

If bandwidth saturated:

  1. Identify high-bandwidth users/applications
  2. Implement QoS if available
  3. Consider upgrade to higher bandwidth link

Emergency Procedures

Complete Network Outage (Wireless Bridge Down)

Impact:

  • No remote access to garage infrastructure
  • Kubernetes cluster still functions internally
  • No internet access from garage
  • Management access requires physical presence

Emergency Access Methods:

  1. Physical console access:

    # [TODO: Document where console cables are stored]
    # Connect laptop directly to switch console port
    
  2. IPMI access (if VPN or alternative route exists):

    # [TODO: Document IPMI network topology]
    

Restoration Steps:

  1. Check Mikrotik radios:

    • Physical inspection of both radios
    • Power cycle if needed
    • Check alignment
  2. Temporary workaround:

    • [TODO: Document backup connectivity method]
    • VPN tunnel over alternative route?
    • Temporary cable run?
  3. Verify restoration:

    ping 192.168.1.8
    ping 192.168.1.20
    ssh admin@192.168.1.20
    

Core Switch (Brocade) Failure

Impact:

  • Loss of VLAN 150/200 routing
  • Kubernetes cluster degraded (storage issues)
  • Loss of 10Gb connectivity to Proxmox hosts

Emergency Actions:

  1. Do NOT reboot all Proxmox hosts simultaneously

    • Cluster may be operational on running workloads
    • Storage connections via VLAN 200 through Arista may still work
  2. Check Brocade status:

    • Physical inspection (power, fans, LEDs)
    • Console access
    • Review logs
  3. If Brocade must be replaced:

    • [TODO: Document backup configuration location]
    • [TODO: Document restoration procedure]
    • [TODO: Document spare hardware location]

Spanning Tree Failure / Network Loop

Impact:

  • Network completely unusable
  • High CPU on all switches
  • Broadcast storm

Emergency Actions:

  1. Disconnect Brocade-Arista links:

    # On Arista (fastest access if SSH still works)
    configure
    interface ethernet 49
    shutdown
    interface ethernet 50
    shutdown
    
  2. Or physically disconnect:

    • Unplug both 40Gb QSFP+ cables between Brocade and Arista
  3. Wait for network to stabilize (30-60 seconds)

  4. Reconnect ONE link only:

    # On Arista
    configure
    interface ethernet 49
    no shutdown
    
  5. Verify stability before enabling second link

Accidental Configuration Change

Symptoms:

  • Network suddenly degraded after change
  • New errors appearing
  • Connectivity loss

Emergency Actions:

  1. Rollback configuration:

    Brocade:

    # Show configuration history
    show configuration
    
    # Revert to previous config
    # [TODO: Document Brocade config rollback method]
    

    Arista:

    # Show rollback options
    show configuration sessions
    
    # Rollback to previous
    configure session rollback <session-name>
    
  2. If rollback not available:

    • Reboot switch (loads startup-config)
    • WARNING: Brief outage during reboot

Switch Configuration

Configure Brocade-Arista LAG (Fix Loop Issue)

Prerequisites:

  • Maintenance window scheduled
  • Both 40Gb QSFP+ cables connected and working
  • Console access to both switches available
  • Configuration backed up

Step 1: Pre-Change Verification

# Verify current state
# On Brocade:
show interface ethernet 1/1/41
show interface ethernet 1/1/42

# On Arista:
show interface ethernet 49
show interface ethernet 50  # Currently disabled

# Document current traffic levels
show interface ethernet 1/1/41 | include rate

Step 2: Configure Brocade LAG

# SSH to Brocade
ssh admin@192.168.1.20

# Enter configuration mode
enable
configure terminal

# Create LAG
lag brocade-to-arista dynamic id [TODO: Choose available LAG ID, e.g., 10]
  ports ethernet 1/1/41 to 1/1/42
  primary-port 1/1/41
  lacp-timeout short
  deploy
exit

# Configure VLAN on LAG
vlan 1
  tagged lag [LAG-ID]
exit

vlan 100
  tagged lag [LAG-ID]
exit

vlan 150
  tagged lag [LAG-ID]
exit

vlan 200
  tagged lag [LAG-ID]
exit

# Apply to interfaces
interface ethernet 1/1/41
  link-aggregate active
exit

interface ethernet 1/1/42
  link-aggregate active
exit

# Save configuration
write memory

# Verify
show lag brief
show lag [LAG-ID]

Step 3: Configure Arista Port-Channel

# SSH to Arista
ssh admin@192.168.1.21

# Enter configuration mode
enable
configure

# Create port-channel
interface Port-Channel1
  description Link to Brocade ICX6610
  switchport mode trunk
  switchport trunk allowed vlan 1,100,150,200
exit

# Add member interfaces
interface Ethernet49
  description Brocade 40G Link 1
  channel-group 1 mode active
  lacp rate fast
exit

interface Ethernet50
  description Brocade 40G Link 2
  channel-group 1 mode active
  lacp rate fast
exit

# Save configuration
write memory

# Verify
show port-channel summary
show interface Port-Channel1
show lacp neighbor

Step 4: Verify Configuration

# On Brocade:
show lag [LAG-ID]
# Should show: 2 ports active

show lacp
# Should show: Negotiated with neighbor

# On Arista:
show port-channel summary
# Should show: Po1(U) with Et49(P), Et50(P)

show lacp neighbor
# Should show: Brocade as partner

# Test traffic balancing
show interface Port-Channel1 counters
show interface ethernet 49 counters
show interface ethernet 50 counters
# Both Et49 and Et50 should show traffic

Step 5: Monitor for Issues

# Watch for 15 minutes
# On Arista:
watch 10 show port-channel summary

# Check for errors
show logging | grep -i Port-Channel1

# Monitor CPU (should be normal)
show processes top

Rollback Plan (if issues occur):

# On Arista (fastest to disable)
configure
interface ethernet 50
shutdown

# On Brocade (if needed)
configure terminal
no lag brocade-to-arista
interface ethernet 1/1/41
  no link-aggregate
interface ethernet 1/1/42
  no link-aggregate

Adding a New VLAN

Example: Adding VLAN 300 for IoT devices

Step 1: Plan VLAN

  • VLAN ID: 300
  • Network: [TODO: e.g., 10.30.0.0/24]
  • Gateway: [TODO: Which device?]
  • Required on: [TODO: Which switches/trunks?]

Step 2: Create VLAN on Brocade

ssh admin@192.168.1.20
enable
configure terminal

# Create VLAN
vlan 300
  name IoT-Network
  tagged ethernet 1/1/41 to 1/1/42  # Trunk to Arista
  tagged ethernet 1/1/[TODO]         # Trunk to Mikrotik
  untagged ethernet 1/1/[TODO]       # Access ports if needed
exit

# If Brocade is gateway:
interface ve 300
  ip address [TODO: IP]/24
exit

# Save
write memory

Step 3: Add to other switches as needed

# On Arista:
configure
vlan 300
  name IoT-Network
exit

interface Port-Channel1
  switchport trunk allowed vlan add 300
exit

write memory

Configuring Jumbo Frames (MTU 9000)

For VLAN 150 and 200 (Ceph networks)

On Brocade:

# Configure MTU on VLAN interfaces
interface ve 150
  mtu 9000
exit

interface ve 200
  mtu 9000
exit

# Configure MTU on physical/LAG interfaces
interface ethernet 1/1/[TODO: storage network ports]
  mtu 9000
exit

write memory

On Arista:

# Configure MTU on interfaces carrying VLAN 150/200
interface ethernet [TODO: ports]
  mtu 9216  # 9000 + overhead
exit

write memory

Verify MTU:

# From Talos node
ping -M do -s 8972 10.150.0.10
# Should succeed without fragmentation

Performance Monitoring

Key Metrics to Monitor

Switch Health:

  • CPU utilization (should be <30% normally)
  • Memory utilization (should be <70%)
  • Temperature (within operating range)
  • Power supply status

Interface Health:

  • Error counters (input/output errors)
  • CRC errors
  • Interface resets
  • Utilization percentage

Traffic Patterns:

  • Bandwidth utilization per interface
  • Top talkers per VLAN
  • Broadcast/multicast rates

Setting Up Monitoring

[TODO: Document monitoring setup]

Options:

  1. SNMP monitoring to Prometheus
  2. sFlow for traffic analysis
  3. Switch logging to Loki
  4. Grafana dashboards

Example Prometheus Targets:

# [TODO: Example prometheus config for SNMP exporter]

Baseline Performance Metrics

Normal Operating Conditions:

MetricExpected ValueAlert Threshold
Wireless Bridge Latency1-2ms> 5ms
Wireless Bridge Loss0%> 1%
Brocade CPU< 20%> 60%
Arista CPU< 15%> 50%
40Gb Link Utilization< 50%> 80%
10Gb Link Utilization< 60%> 85%

[TODO: Add baseline measurements]


Maintenance Windows

Pre-Maintenance Checklist

Before any network maintenance:

  • Schedule maintenance window
  • Notify all users
  • Back up switch configurations
    # Brocade
    show running-config > backup-$(date +%Y%m%d).cfg
    
    # Arista
    show running-config > backup-$(date +%Y%m%d).cfg
    
  • Document current state
  • Have rollback plan ready
  • Ensure console access available
  • Test backup connectivity method

Post-Maintenance Checklist

After any network maintenance:

  • Verify all links are up
    show interface brief
    show lag brief  # Brocade
    show port-channel summary  # Arista
    
  • Check for errors
    show logging | include error
    
  • Test connectivity to all VLANs
  • Monitor for 30 minutes for issues
  • Update documentation with any changes
  • Save configurations
    write memory
    

Regular Maintenance Tasks

Weekly:

  • Review switch logs for errors/warnings
  • Check interface error counters
  • Verify wireless bridge performance

Monthly:

  • Review bandwidth utilization trends
  • Check for firmware updates
  • Verify backup configurations are current

Quarterly:

  • Review and update network documentation
  • Test emergency procedures
  • Review and optimize switch configurations

Configuration Backup

Backing Up Switch Configurations

Brocade ICX6610:

# Method 1: Copy to TFTP server
copy running-config tftp [TODO: TFTP server IP] brocade-backup-$(date +%Y%m%d).cfg

# Method 2: Display and save manually
show running-config > /tmp/brocade-config.txt

# [TODO: Document automated backup method]

Arista 7050:

# Show running config
show running-config

# Copy to USB (if available)
copy running-config usb:/brocade-backup-$(date +%Y%m%d).cfg

# [TODO: Document automated backup method]

Storage Location:

  • [TODO: Document where configurations are backed up]
  • Consider: Git repository for version control
  • Consider: Automated daily backups via Ansible

Restoring Configurations

Brocade:

# Load config from file
copy tftp running-config [TFTP-IP] [filename]

# Or manually paste config
configure terminal
# Paste configuration

Arista:

# Copy config from file
copy usb:/backup.cfg running-config

# Or configure manually
configure
# Paste configuration

Security Considerations

Access Control

[TODO: Document security policies]

  • Who has access to switch management?
  • How are credentials managed?
  • Is 2FA available/configured?
  • Are management VLANs isolated?

Security Best Practices

  1. Change default passwords
  2. Disable unused ports
  3. Enable port security where appropriate
  4. Configure DHCP snooping
  5. Enable storm control
  6. Regular firmware updates
  7. Monitor for unauthorized devices

Useful Commands Reference

Brocade ICX6610 Quick Reference

# Basic show commands
show version
show running-config
show interface brief
show vlan
show lag
show mac-address
show spanning-tree
show log

# Interface management
interface ethernet 1/1/1
  enable
  disable
  description [text]

# Save configuration
write memory

Arista 7050 Quick Reference

# Basic show commands
show version
show running-config
show interfaces status
show vlan
show port-channel summary
show mac address-table
show spanning-tree
show logging

# Interface management
configure
interface ethernet 1
  shutdown
  no shutdown
  description [text]

# Save configuration
write memory

Contacts and Escalation

[TODO: Fill in contact information]

RoleNameContactEscalation Level
Primary Network Admin[TODO][TODO]1
Secondary Contact[TODO][TODO]2
Vendor Support - Brocade[TODO][TODO]3
Vendor Support - Arista[TODO][TODO]3

Change Log

DateChangePersonImpactNotes
2025-10-14Initial runbook createdClaudeNoneBaseline documentation
[TODO][TODO][TODO][TODO][TODO]

References

40GB Ceph Storage Network Configuration

Project Overview

This project configures the 40GB network infrastructure for Ceph cluster storage (VLAN 200), providing dedicated high-bandwidth links for Ceph OSD replication traffic. The configuration enables 40Gbps connectivity for three Proxmox hosts with routing support for the fourth host through a Brocade-Arista link aggregation.

Key Benefits:

  • 🚀 4x bandwidth increase for hosts with 40Gb links (10Gb → 40Gb)
  • 🔗 80Gbps aggregated bandwidth between Brocade and Arista switches
  • 📦 Jumbo frame support (MTU 9000) for improved Ceph performance
  • 🔀 No bottlenecks for mixed-speed traffic (10Gb and 40Gb hosts)
  • 🛡️ Resilient design with LACP link aggregation and failover

Project Status: ✅ Ready for Implementation


Architecture Summary

Current State (Before Configuration)

VLAN 200 Traffic:

  • All Proxmox hosts using 2x 10Gb bonds to Brocade
  • Shared bandwidth with VLAN 100, 150
  • MTU 1500 (no jumbo frames)
  • Brocade-Arista: Only 1x 40Gb link active (2nd disabled due to loop)

Limitations:

  • Ceph replication limited to ~9-10 Gbps per host
  • Contention with other VLAN traffic
  • No jumbo frame support

Target State (After Configuration)

VLAN 200 Traffic:

  • Proxmox-01, 02, 04: Dedicated 40Gb links to Arista
  • Proxmox-03: 10Gb bond to Brocade (no 40Gb link available)
  • Brocade-Arista: 2x 40Gb LAG (80Gbps total)
  • MTU 9000 throughout the path
  • Layer 3 routing on Brocade for Proxmox-03

Benefits:

  • Proxmox-01, 02, 04: 40 Gbps dedicated Ceph bandwidth
  • Proxmox-03: 10 Gbps Ceph bandwidth (no bottleneck at switches)
  • Direct switching for 40Gb hosts (no routing latency)
  • Optimized for large Ceph object transfers (jumbo frames)

Network Topology

Physical Connections

┌─────────────┐
│ Proxmox-01  │───────40Gb───────┐
│ (10.200.0.1)│                  │
└─────────────┘                  │
                                 │
┌─────────────┐                  │
│ Proxmox-02  │───────40Gb───────┤
│ (10.200.0.2)│                  │
└─────────────┘                  │    ┌──────────────┐
                                 ├────│ Arista 7050  │
┌─────────────┐                  │    │              │
│ Proxmox-04  │───────40Gb───────┤    │ Et27: Px-01  │
│ (10.200.0.4)│                  │    │ Et28: Px-02  │
└─────────────┘                  │    │ Et29: Px-04  │
                                 │    └──────┬───────┘
                                 │           │
┌─────────────┐                  │      2x 40Gb LAG
│ Proxmox-03  │─┐                │    (Port-Channel1)
│ (10.200.0.3)│ │                │           │
└─────────────┘ │                │    ┌──────┴───────┐
                │                │    │ Brocade 6610 │
          2x 10Gb bond           │    │              │
          (LAG to Brocade)       │    │ LAG 11: 80Gb │
                │                │    │ VE200: GW    │
                └────────────────┘    └──────────────┘
                                      (10.200.0.254)

Traffic Paths

Direct 40Gb Paths (no routing):

  • Proxmox-01 ↔ Proxmox-02: Via Arista (switched)
  • Proxmox-01 ↔ Proxmox-04: Via Arista (switched)
  • Proxmox-02 ↔ Proxmox-04: Via Arista (switched)

Routed Paths (through Brocade):

  • Proxmox-03 ↔ Proxmox-01/02/04:
    • Px-03 → 10Gb bond → Brocade VE200 (Layer 3 routing)
    • Brocade → 80Gb LAG → Arista
    • Arista → 40Gb → Px-01/02/04
    • Bandwidth: Limited to 10Gb at Px-03, but LAG prevents bottleneck

Configuration Phases

This project is divided into 6 phases that must be completed in order:

Phase 1: Configure Brocade-Arista 40GB LAG ⚙️

Documentation

Enable the second 40Gb link between Brocade and Arista by configuring LACP link aggregation.

Key Tasks:

  • Create LAG on Brocade (LAG ID 11)
  • Create Port-Channel on Arista (Port-Channel1)
  • Enable both 40Gb links (Et25, Et26 on Arista)
  • Configure LACP with fast timeout

Outcome: 80Gbps bandwidth between switches, eliminating potential bottleneck.


Phase 2: Configure VLAN 200 Routing on Brocade 🌐

Documentation

Configure Layer 3 routing on Brocade for Proxmox-03 to reach other hosts via VLAN 200.

Key Tasks:

  • Create VLAN interface VE200 on Brocade (10.200.0.254/24)
  • Set MTU 9000 on VE200
  • Configure default route on Proxmox-03 pointing to Brocade

Outcome: Proxmox-03 can route to all other hosts on VLAN 200 through Brocade.


Phase 3: Configure Arista VLAN 200 with Jumbo Frames 📦

Documentation

Enable jumbo frame support (MTU 9216) on all Arista interfaces carrying VLAN 200 traffic.

Key Tasks:

  • Set MTU 9216 on Port-Channel1 (Brocade link)
  • Set MTU 9216 on Et27, Et28, Et29 (Proxmox links)
  • Verify VLAN 200 trunk configuration

Outcome: Arista switch supports jumbo frames for optimal Ceph performance.


Phase 4: Identify 40GB NICs on Proxmox Hosts 🔍

Documentation

Identify which physical network interfaces are the 40Gb NICs on each Proxmox host and map them to Arista switch ports.

Key Tasks:

  • Use ethtool/LLDP/link flap to identify 40Gb interfaces
  • Document interface names (e.g., enp2s0, enp7s0)
  • Map Proxmox hosts to Arista ports (Et27, Et28, Et29)
  • Record MAC addresses for tracking

Outcome: Complete mapping of Proxmox hosts to Arista ports with interface names documented.


Documentation

Reconfigure Proxmox hosts to move VLAN 200 from 10Gb bonds to dedicated 40Gb interfaces.

Key Tasks:

  • Edit /etc/network/interfaces on each host
  • Move vmbr200 from bond1 to 40Gb interface (with VLAN 200 tag)
  • Set MTU 9000 on all interfaces
  • Test connectivity after each host (one at a time!)

Outcome: Proxmox-01, 02, 04 using 40Gb links; Proxmox-03 using 10Gb bond with routing.

⚠️ Important: Configure hosts one at a time to minimize Ceph disruption!


Phase 6: Testing and Verification ✅

Documentation

Comprehensive testing to validate the configuration and measure performance improvements.

Test Categories:

  1. Connectivity Tests: Ping between all hosts, ARP resolution
  2. MTU Tests: Jumbo frame validation (ping -M do -s 8972)
  3. Bandwidth Tests: iperf3 throughput measurements
  4. Ceph Performance: OSD network tests, rebalance speed
  5. Switch Verification: LAG status, traffic distribution, error counters
  6. Failover Tests: LAG member failure, host failure scenarios

Outcome: Validated 4x performance improvement with comprehensive test results.


Quick Start Guide

Prerequisites

Before starting, ensure you have:

  • SSH access to all switches (Brocade, Arista)
  • SSH/IPMI access to all Proxmox hosts
  • Backup of all current configurations
  • Maintenance window scheduled (recommended: 2-4 hours)
  • Console access ready (IPMI) in case of network issues
  • This documentation downloaded and available offline

Execution Order

Follow phases in strict order:

  1. Phase 1 (30 min): Configure switch LAG - minimal disruption
  2. Phase 2 (15 min): Add Brocade routing - no disruption to existing hosts
  3. Phase 3 (15 min): Set MTU on Arista - minimal disruption
  4. Phase 4 (30 min): Identify interfaces - read-only, no disruption
  5. Phase 5 (60 min): Reconfigure hosts - brief Ceph disruption per host
  6. Phase 6 (60+ min): Testing - monitor for issues

Total time: 3-4 hours (including testing)

Rollback Strategy

Each phase includes a rollback procedure. Key rollback points:

  • Phase 1: Disable 2nd 40Gb link on Arista (immediate)
  • Phase 2: Disable VE200 on Brocade (immediate)
  • Phase 3: Revert MTU on Arista (immediate)
  • Phase 5: Restore /etc/network/interfaces backup on each host

Critical: Keep one SSH session open to each device before making changes!


Configuration Reference

IP Addressing (VLAN 200)

HostManagement IPVLAN 200 IPInterfaceLink SpeedConnected To
Proxmox-01192.168.1.6210.200.0.1/24enp2s0.20040GbArista Et27
Proxmox-02192.168.1.6310.200.0.2/24enp7s0.20040GbArista Et28
Proxmox-03192.168.1.6410.200.0.3/24bond1.2002x 10GbBrocade LAG 2
Proxmox-04192.168.1.6610.200.0.4/24enp7s0.20040GbArista Et29
Brocade192.168.1.2010.200.0.254/24VE200-Gateway

Note: Interface names may vary - update based on Phase 4 findings.

Switch Configuration Summary

Brocade ICX6610:

  • LAG 11: "brocade-to-arista" (ports 1/2/1, 1/2/6)
    • LACP dynamic, short timeout
    • VLANs 1, 20, 100, 150, 200 tagged
    • MTU 9000
  • VLAN 200:
    • VE200: 10.200.0.254/24, MTU 9000
    • Routing enabled

Arista 7050:

  • Port-Channel1: (Et25, Et26)
    • LACP active, fast rate
    • Trunk: VLANs 1, 20, 100, 150, 200
    • MTU 9216
  • Et27, Et28, Et29:
    • Trunk mode
    • VLAN 200 tagged
    • MTU 9216

MTU Configuration

DeviceInterfaceMTUNotes
BrocadeVE2009000VLAN 200 gateway
BrocadeLAG 119000To Arista
AristaPort-Channel19216From Brocade
AristaEt27-299216To Proxmox
Proxmox40Gb NIC9000Physical interface
ProxmoxVLAN interface9000enp*.200
Proxmoxvmbr2009000Bridge

Why 9216 on Arista? Arista requires extra overhead for VLAN tagging (9000 + 216 = 9216).


Expected Performance

Bandwidth Improvements

ConnectionBefore (10Gb)After (40Gb)Improvement
Px-01 ↔ Px-02~9 Gbps~35-38 Gbps4.0x
Px-01 ↔ Px-04~9 Gbps~35-38 Gbps4.0x
Px-02 ↔ Px-04~9 Gbps~35-38 Gbps4.0x
Px-03 ↔ Others~9 Gbps~9-10 Gbps1.0x*

* Proxmox-03 limited by 10Gb uplink, but no switch bottleneck due to 80Gb LAG.

Latency Improvements

PathExpected RTTNotes
40Gb direct< 0.5 msSwitched only (no routing)
Via Brocade< 1.5 msOne Layer 3 hop
Before (10Gb shared)1-2 msShared bandwidth, potential queuing

Ceph Performance

Expected improvements:

  • Rebalance speed: 4x faster on hosts with 40Gb links
  • OSD recovery: Significantly reduced time for large objects
  • Client I/O: Reduced latency for Kubernetes pods using RBD/CephFS
  • Concurrent operations: Better performance under load

Troubleshooting

Common Issues

Issue: LAG/Port-Channel won't form

  • Check: LACP mode (both sides "active"), VLAN configuration matches
  • Verify: Physical links are up, no cable/SFP issues
  • Ref: Phase 1 Troubleshooting

Issue: Jumbo frames don't work

  • Check: MTU on every hop (host → switch → switch → host)
  • Test: ping -M do -s 8972 <destination>
  • Ref: Phase 3 Troubleshooting

Issue: Proxmox-03 can't reach other hosts

  • Check: VE200 is up, routing table on Px-03, gateway configured
  • Verify: LAG 11 is up between Brocade and Arista
  • Ref: Phase 2 Troubleshooting

Issue: Lower than expected bandwidth

  • Check: CPU usage during iperf3, network card offloading settings
  • Test: Multi-stream iperf3 (-P 10)
  • Ref: Phase 6 Troubleshooting

Emergency Contacts

RoleResponsibilityContact
Network AdminSwitch configuration[FILL IN]
System AdminProxmox hosts[FILL IN]
Storage AdminCeph cluster[FILL IN]

Maintenance and Operations

Regular Checks

Weekly:

  • Monitor switch logs for errors
  • Check LAG/Port-Channel status
  • Review interface error counters

Monthly:

  • Verify bandwidth utilization trends
  • Test failover procedures
  • Review and update documentation

Quarterly:

  • Full connectivity and performance test
  • Review configurations for optimization
  • Plan for firmware updates

Backup Procedures

Switch Configurations:

# Brocade
show running-config > brocade-backup-$(date +%Y%m%d).txt

# Arista
show running-config > arista-backup-$(date +%Y%m%d).txt

Proxmox Network Configs:

# On each host
cp /etc/network/interfaces /etc/network/interfaces.backup-$(date +%Y%m%d)

Store backups in: /home/derek/projects/dapper-cluster/docs/src/operations/network-configs/backups/

Future Enhancements

Potential Improvements:

  • Add 40Gb link to Proxmox-03 (hardware upgrade)
  • Implement network monitoring with Prometheus SNMP exporter
  • Configure sFlow for traffic analysis
  • Set up automated configuration backups (Ansible)
  • Add redundant Brocade-Arista links (if needed)
  • Upgrade to 100Gb links (future-proofing)

Success Metrics

Technical Metrics

  • Bandwidth: 40Gb links achieve 35+ Gbps throughput
  • Latency: < 1ms RTT between hosts on Arista
  • MTU: 9000 byte frames work end-to-end
  • Availability: 99.9%+ uptime (LAG provides redundancy)
  • Ceph: Rebalance speed increased by 4x

Operational Metrics

  • Deployment Time: < 4 hours total
  • Downtime: < 5 minutes per host (during reconfiguration)
  • Documentation: Complete and accurate
  • Rollback: < 2 minutes to revert changes
  • Team Readiness: All staff trained on new configuration

References

Internal Documentation

Phase Documentation

Vendor Documentation

  • Brocade ICX6610 Command Reference
  • Arista 7050 Configuration Guide
  • Proxmox Network Configuration
  • Ceph Network Recommendations

Project History

DateMilestoneStatus
2025-10-14Project planning and documentation✅ Complete
TBDPhase 1: Brocade-Arista LAG📋 Ready
TBDPhase 2: Brocade routing📋 Ready
TBDPhase 3: Arista MTU📋 Ready
TBDPhase 4: Interface identification📋 Ready
TBDPhase 5: Proxmox reconfiguration📋 Ready
TBDPhase 6: Testing📋 Ready
TBDProject completion⏳ Pending

License and Credits

Documentation created by: Claude (Anthropic AI Assistant) Date: October 14, 2025 Version: 1.0 Project: Dapper Cluster 40GB Storage Network Upgrade

Contributors:

  • Network architecture review and validation
  • Configuration procedures and best practices
  • Testing methodology and verification procedures

Ready to begin? Start with Phase 1: Configure Brocade-Arista LAG

Monitoring Stack Gap Analysis - October 17, 2025

Executive Summary

Comprehensive review of Grafana, Prometheus, and Loki monitoring stack revealed the core components are functional with 97.6% operational status. Critical issues identified require both Kubernetes configuration changes and external Ceph infrastructure remediation.


Component Status

✅ Grafana (Healthy)

  • Status: Running (2/2 containers)
  • Memory: 441Mi
  • URL: grafana.chelonianlabs.com
  • Datasources: Properly configured
    • Prometheus: http://prometheus-operated.observability.svc.cluster.local:9090
    • Loki: http://loki-headless.observability.svc.cluster.local:3100
    • Alertmanager: http://alertmanager-operated.observability.svc.cluster.local:9093
  • Dashboards: 35+ configured and loading
  • Issues: None

✅ Prometheus (Healthy with Minor Issues)

  • Status: Running HA mode (2 replicas)
  • Memory: 2.1GB per pod
  • Scrape Success: 161/165 targets healthy (97.6%)
  • Storage: 5.8GB/100GB used (6%)
  • Retention: 14 days
  • Monitoring Coverage:
    • 38 ServiceMonitors
    • 7 PodMonitors
    • 44 PrometheusRules
  • Issues:
    • 4 targets down (2.4% failure rate)
    • Duplicate timestamp warnings from kube-state-metrics

⚠️ Loki (Functional but Dropping Logs)

  • Status: Running (2/2 containers)
  • Memory: 340Mi
  • Storage: 1.6GB/30GB used (5%)
  • Retention: 14 days
  • Log Collection: Successfully collecting from 17 namespaces
  • Issues:
    • CRITICAL: Max entry size limit (256KB) exceeded
    • Plex logs (553KB entries) being rejected
    • Error: Max entry size '262144' bytes exceeded

✅ Promtail (Healthy)

  • Status: DaemonSet running on all 11 nodes
  • Memory: 70-140Mi per pod
  • Target: http://loki-headless.observability.svc.cluster.local:3100/loki/api/v1/push
  • Issues: None (successfully shipping logs despite Loki rejections)

⚠️ Alertmanager (Healthy but Active Alerts)

  • Status: Running (2/2 containers)
  • Memory: 37Mi
  • Active Alerts: 19 alerts firing
  • Issues: See Active Alerts section below

Critical Issues

1. Loki Log Entry Size Limit

Severity: High Impact: Logs from high-volume applications being dropped

Details:

  • Default max entry size: 262,144 bytes (256KB)
  • Plex application producing 553KB log entries
  • Logs silently dropped without alerting

Fix Applied:

  • ✅ Updated /kubernetes/apps/observability/loki/app/helmrelease.yaml
  • Added limits_config.max_line_size: 1048576 (1MB)
  • Action Required: Commit and push to trigger Flux reconciliation

Verification:

# After deployment, verify no more errors:
kubectl logs -n observability -l app.kubernetes.io/name=promtail --tail=100 | grep "exceeded"

2. External Ceph Cluster Health Warnings

Severity: High Impact: PVC provisioning failures, pod scheduling blocked

Details: External Ceph cluster (running on Proxmox hosts) showing HEALTH_WARN:

  1. PG_AVAILABILITY (Critical):

    • 128 placement groups inactive
    • 128 placement groups incomplete
    • This is blocking new PVC creation
  2. MDS_SLOW_METADATA_IO:

    • 1 MDS (metadata server) reporting slow I/O
    • Impacts CephFS performance
  3. MDS_TRIM:

    • 1 MDS behind on trimming
    • Can impact metadata operations

Ceph Cluster Info:

  • FSID: 782dd297-215e-4c35-b7cf-659c20e6909e
  • Version: 18.2.7-0 (Reef)
  • Monitors: proxmox-02 (10.150.0.2), proxmox-03 (10.150.0.3), Proxmox-04 (10.150.0.4)
  • Capacity: 195TB available / 244TB total (80% used)

Action Required: These are infrastructure-level issues that must be resolved on the Proxmox/Ceph cluster directly:

# SSH to Proxmox host and run:
ceph health detail
ceph pg dump | grep -E "inactive|incomplete"
ceph osd tree
ceph fs status cephfs

# Likely fixes (depending on root cause):
# - Check OSD status and bring up any down OSDs
# - Verify network connectivity between OSDs
# - Check disk space on OSD nodes
# - Review Ceph logs for specific PG issues

Kubernetes Impact:

  • ❌ Gatus pod stuck in Pending (PVC provisioning failure)
  • ❌ VolSync destination pods failing
  • ❌ Any new workloads requiring CephFS storage blocked

Prometheus Scrape Target Failures

Down Targets (4 total):

  1. athena.manor:9221 - Unnamed exporter (likely SNMP)
  2. circe.manor:9221 - Unnamed exporter (likely SNMP)
  3. nut-upsd.kube-system.svc.cluster.local:3493 - NUT UPS exporter
  4. zigbee-controller-garage.manor - Zigbee controller

Analysis: All down targets are edge devices or external services. Core Kubernetes monitoring intact.

Recommended Actions:

  • Verify network connectivity to .manor hostnames
  • Check if SNMP exporters are running
  • Investigate NUT UPS service in kube-system namespace
  • Verify zigbee-controller service status

Active Alerts (19 Total)

High Priority:

  1. TargetDown - Related to 4 targets listed above
  2. KubePodNotReady - Related to Ceph PVC provisioning issues (gatus, volsync)
  3. KubeDeploymentRolloutStuck - Likely gatus deployment
  4. KubePersistentVolumeFillingUp - Check which PVs

Medium Priority:

  1. CPUThrottlingHigh - Investigate which pods/namespaces
  2. KubeJobFailed - 2 failed jobs identified:
    • kometa-29344680 (media namespace)
    • plex-off-deck-29344620 (media namespace)
  3. VolSyncVolumeOutOfSync - Expected with current Ceph issues

Informational:

  1. Watchdog - Always firing (heartbeat)
  2. PrometheusDuplicateTimestamps - kube-state-metrics timing issue (low impact)

Recommendations

Immediate Actions (Required before further work):

  1. Loki configuration updated - Ready for commit
  2. ⚠️ Fix Ceph PG issues - Must be done on Proxmox hosts
  3. ⚠️ Verify Ceph health - Run ceph health detail on Proxmox

Post-Ceph Fix:

  1. Delete stuck pods to retry provisioning:

    kubectl delete pod -n observability gatus-6fcfb64bc8-zz996
    kubectl delete pod -n observability volsync-dst-gatus-dst-8wvtx
    
  2. Investigate and fix down Prometheus targets:

    • Check SNMP exporter configurations
    • Verify NUT UPS service
    • Test network connectivity to .manor devices
  3. Review CPU throttling alerts:

    kubectl top pods -A --sort-by=cpu
    # Adjust resource limits as needed
    
  4. Clean up failed CronJobs in media namespace

Long-term Improvements:

  • Add Loki ingestion metrics dashboard
  • Configure log sampling/filtering for high-volume apps
  • Set up PVC capacity monitoring alerts
  • Review and tune Prometheus scrape intervals
  • Consider adding CephFS-specific dashboards

Verification Checklist

After applying fixes:

  • Loki accepting large log entries (check Promtail logs)
  • No "exceeded" errors in Promtail logs
  • Ceph cluster shows HEALTH_OK
  • Gatus pod Running (2/2)
  • All PVCs Bound
  • Prometheus targets down count <= 2 (excluding optional edge devices)
  • Active alerts reduced to baseline (~5-10 expected)
  • All core namespace pods Running

Infrastructure Context

Deployment Method:

  • GitOps: FluxCD
  • Workflow: Edit repo → User commits → User pushes → Flux reconciles

Storage:

  • Provider: External Ceph cluster (Proxmox)
  • Storage Classes: cephfs-shared (default), cephfs-static
  • Provisioner: rook-ceph.cephfs.csi.ceph.com

Monitoring Namespace:

  • Namespace: observability
  • Components: Grafana, Prometheus (HA), Loki, Promtail, Alertmanager
  • Additional: VPA, Goldilocks, Gatus, Kromgo, various exporters

Next Steps

  1. User Action: Review and commit Loki configuration changes
  2. User Action: Fix Ceph PG availability issues on Proxmox
  3. After Ceph Fix: Proceed with pod cleanup and target investigations
  4. Monitor: Watch for new alerts or recurring issues

Generated: 2025-10-17 Analysis Duration: ~30 minutes Status: Awaiting user commit and Ceph infrastructure remediation

Ceph RBD Storage Migration Candidates

Analysis performed: 2025-10-17

Overview

This document identifies workloads in the cluster that would benefit from migrating to ceph-rbd (Ceph block storage) instead of cephfs-shared (CephFS shared filesystem).

Key Principle: Databases, time-series stores, and stateful services requiring high I/O performance should use block storage (RBD). Shared files, media libraries, and backups should use filesystem storage (CephFS).


Current Status

Already Using ceph-rbd ✓

  • PostgreSQL (CloudNativePG) - 20Gi data + 5Gi WAL

Storage Classes Available

  • ceph-rbd - Block storage (RWO) - Best for databases
  • cephfs-shared - Shared filesystem (RWX) - Best for shared files/media
  • cephfs-static - Static CephFS volumes

Storage Configuration Patterns

Before migrating workloads, it's important to understand how PVCs are created in this cluster:

Pattern 1: Volsync Component Pattern (Most Apps)

Used by: 41+ applications including all media apps, self-hosted apps, home automation, AI apps

How it works:

  1. Application's ks.yaml includes the volsync component:

    components:
      - ../../../../flux/components/volsync
    
  2. PVC is created by the volsync component template (flux/components/volsync/pvc.yaml)

  3. Storage configuration is set via postBuild.substitute in the ks.yaml:

    postBuild:
      substitute:
        APP: prowlarr
        VOLSYNC_CAPACITY: 5Gi
        VOLSYNC_STORAGECLASS: cephfs-shared      # Default if not specified
        VOLSYNC_ACCESSMODES: ReadWriteMany       # Default if not specified
        VOLSYNC_SNAPSHOTCLASS: cephfs-snapshot   # Default if not specified
    

Default values:

  • Storage Class: cephfs-shared
  • Access Modes: ReadWriteMany
  • Snapshot Class: cephfs-snapshot

Examples:

  • Prowlarr: kubernetes/apps/media/prowlarr/ks.yaml
  • Obsidian CouchDB: kubernetes/apps/selfhosted/obsidian-couchdb/ks.yaml
  • Most workloads with < 100Gi storage needs

Pattern 2: Direct HelmRelease Pattern

Used by: Large observability workloads (Prometheus, Loki, AlertManager)

How it works:

  1. Storage is defined directly in the HelmRelease values
  2. No volsync component used
  3. PVC created by Helm chart templates

Example (Prometheus):

# kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: cephfs-shared
          resources:
            requests:
              storage: 100Gi

Examples:

  • Prometheus: kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
  • Loki: kubernetes/apps/observability/loki/app/helmrelease.yaml
  • AlertManager: kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml

Migration Candidates

🔴 HIGH Priority - Data Durability Risk

1. Dragonfly Redis

  • Namespace: database
  • Current Storage: NONE (ephemeral, in-memory only)
  • Current Size: N/A (data lost on restart)
  • Replicas: 3
  • Recommended: Add ceph-rbd PVCs (~10Gi each for snapshots/persistence)
  • Why: Redis alternative running in cluster mode needs persistent snapshots for:
    • Data durability across restarts
    • Cluster state recovery
    • Snapshot-based backups
  • Impact: HIGH - Currently losing all data on pod restart
  • Config Location: kubernetes/apps/database/dragonfly-redis/cluster/cluster.yaml
  • Migration Complexity: Medium - requires modifying Dragonfly CRD to add volumeClaimTemplates

2. EMQX MQTT Broker

  • Namespace: database
  • Current Storage: NONE (emptyDir, ephemeral)
  • Current Size: N/A (data lost on restart)
  • Replicas: 3 (StatefulSet)
  • Recommended: Add ceph-rbd PVCs (~5-10Gi each for session/message persistence)
  • Why: MQTT brokers need persistent storage for:
    • Retained messages
    • Client subscriptions
    • Session state for QoS > 0
    • Cluster configuration
  • Impact: HIGH - Currently losing retained messages and sessions on restart
  • Config Location: kubernetes/apps/database/emqx/cluster/cluster.yaml
  • Migration Complexity: Medium - requires modifying EMQX CRD to add persistent volumes

🟡 MEDIUM Priority - Performance & Best Practices

3. CouchDB (obsidian-couchdb)

  • Namespace: selfhosted
  • Current Storage: cephfs-shared
  • Current Size: 5Gi
  • Replicas: 1 (Deployment)
  • Storage Pattern:Volsync Component (kubernetes/apps/selfhosted/obsidian-couchdb/ks.yaml)
  • Recommended: Migrate to ceph-rbd
  • Why: NoSQL database benefits from:
    • Better I/O performance for document reads/writes
    • Improved fsync performance for data integrity
    • Block-level snapshots for consistent backups
  • Impact: Medium - requires backup, PVC migration, restore
  • Migration Complexity: Medium - GitOps workflow with volsync pattern
    • Update ks.yaml postBuild substitutions
    • Commit and push changes
    • Flux recreates PVC with new storage class
    • Volsync handles data restoration

4. Prometheus

  • Namespace: observability
  • Current Storage: cephfs-shared
  • Current Size: 2x100Gi (200Gi total across 2 replicas)
  • Replicas: 2 (StatefulSet)
  • Storage Pattern: 🔧 Direct HelmRelease (kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml)
  • Recommended: Migrate to ceph-rbd
  • Why: Time-series database with:
    • Heavy write workload (constant metric ingestion)
    • Random read patterns for queries
    • Significant performance gains with block storage
    • Better compaction performance
  • Impact: HIGH - Largest performance improvement opportunity
  • Migration Complexity: High
    • Large data volume (200Gi total)
    • Update HelmRelease volumeClaimTemplate.spec.storageClassName
    • Commit and push changes
    • Flux recreates StatefulSet with new storage
    • Consider data retention during migration

5. Loki

  • Namespace: observability
  • Current Storage: cephfs-shared
  • Current Size: 30Gi
  • Replicas: 1 (StatefulSet)
  • Storage Pattern: 🔧 Direct HelmRelease (kubernetes/apps/observability/loki/app/helmrelease.yaml)
  • Recommended: Migrate to ceph-rbd
  • Why: Log aggregation database benefits from:
    • Better write performance for high-volume log ingestion
    • Improved compaction and chunk management
    • Block storage better suited for LSM-tree based storage
  • Impact: Medium - noticeable improvement in log write performance
  • Migration Complexity: Medium
    • Moderate data size
    • Update HelmRelease singleBinary.persistence.storageClass
    • Commit and push changes
    • Flux recreates StatefulSet with new storage
    • Can tolerate some log loss during migration

6. AlertManager

  • Namespace: observability
  • Current Storage: cephfs-shared
  • Current Size: 2Gi
  • Replicas: 1 (StatefulSet)
  • Storage Pattern: 🔧 Direct HelmRelease (kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml)
  • Recommended: Migrate to ceph-rbd
  • Why: Alert state persistence benefits from:
    • Consistent snapshot capabilities
    • Better fsync performance for state writes
  • Impact: Low - small storage footprint, quick migration
  • Migration Complexity: Low
    • Small data size
    • Update HelmRelease alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName
    • Commit and push changes
    • Flux recreates StatefulSet with new storage
    • Minimal downtime

What Should Stay on CephFS

The following workloads are correctly using CephFS and should NOT be migrated:

Media & Shared Files (RWX Access Required)

  • Media libraries (Plex, Sonarr, Radarr, etc.) - Need shared filesystem access
  • AI models (Ollama 100Gi) - Large files with potential shared access
  • Application configs - Often need shared access across pods

Backup Storage

  • Volsync repositories (cephfs-static) - Restic repositories work well on filesystem
  • MinIO data (cephfs-static, 10Ti) - Object storage on filesystem is appropriate

Other

  • OpenEBS etcd/minio - Already using local PVs (mayastor-etcd-localpv, openebs-minio-localpv)
  • Runner work volumes - Ephemeral workload storage

Migration Summary

Total Storage to Migrate

  • Dragonfly: +30Gi (3 replicas x 10Gi) - NEW storage
  • EMQX: +15-30Gi (3 replicas x 5-10Gi) - NEW storage
  • CouchDB: 5Gi (migrate from cephfs)
  • Prometheus: 200Gi (migrate from cephfs)
  • Loki: 30Gi (migrate from cephfs)
  • AlertManager: 2Gi (migrate from cephfs)

Total New ceph-rbd Needed: ~280-295Gi Currently Migrating from CephFS: ~237Gi

  1. Phase 0: Validation (Test the process)

    • AlertManager - LOW RISK test case to validate GitOps workflow
  2. Phase 1: Data Durability (Immediate)

    • Dragonfly - Add persistent storage
    • EMQX - Add persistent storage
  3. Phase 2: Small Databases (Quick Wins)

    • CouchDB - Medium complexity, important for Obsidian data
  4. Phase 3: Large Time-Series DBs (Performance)

    • Loki - Medium size, good performance gains
    • Prometheus - Large size, significant performance gains

Migration Checklists

Phase 0: AlertManager Migration (Validation Test)

Goal: Validate the GitOps migration workflow with a low-risk workload

Pre-Migration Checklist:

  • Verify current AlertManager state
    kubectl get pod -n observability -l app.kubernetes.io/name=alertmanager
    kubectl get pvc -n observability -l app.kubernetes.io/name=alertmanager
    kubectl describe pvc -n observability alertmanager-kube-prometheus-stack-alertmanager-db-alertmanager-kube-prometheus-stack-alertmanager-0 | grep "StorageClass:"
    
  • Check current storage usage
    kubectl exec -n observability alertmanager-kube-prometheus-stack-alertmanager-0 -- df -h /alertmanager
    
  • Document current alerts (optional - state will rebuild)
    kubectl get prometheusrule -A
    
  • Verify ceph-rbd storage class exists
    kubectl get storageclass ceph-rbd
    kubectl get volumesnapshotclass ceph-rbd-snapshot
    

Migration Steps:

  • Create feature branch
    git checkout -b feat/alertmanager-rbd-migration
    
  • Update HelmRelease configuration
    • File: kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
    • Change: alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.storageClassName: ceph-rbd
    • Line: ~104 (search for alertmanager storageClassName)
  • Commit changes
    git add kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
    git commit -m "feat(alertmanager): migrate to ceph-rbd storage"
    
  • Push to remote
    git push origin feat/alertmanager-rbd-migration
    
  • Monitor Flux reconciliation
    flux reconcile kustomization kube-prometheus-stack -n observability --with-source
    watch kubectl get pods -n observability -l app.kubernetes.io/name=alertmanager
    
  • Verify new PVC created with ceph-rbd
    kubectl get pvc -n observability -l app.kubernetes.io/name=alertmanager
    kubectl describe pvc -n observability <new-pvc-name> | grep "StorageClass:"
    
  • Verify AlertManager is running
    kubectl get pod -n observability -l app.kubernetes.io/name=alertmanager
    kubectl logs -n observability -l app.kubernetes.io/name=alertmanager --tail=50
    
  • Check AlertManager UI (https://alertmanager.${SECRET_DOMAIN})
    • UI loads successfully
    • Alerts are being received
    • Silences can be created
  • Wait 24 hours to verify stability
  • Merge to main
    git checkout main
    git merge feat/alertmanager-rbd-migration
    git push origin main
    

Post-Migration Validation:

  • Verify old PVC is deleted (should happen automatically)
    kubectl get pvc -A | grep alertmanager
    
  • Check Ceph RBD usage
    kubectl exec -n rook-ceph deploy/rook-ceph-tools -- ceph df
    
  • Document lessons learned for larger migrations
  • Update this checklist with any issues encountered

Rollback Plan (if needed):

  • Revert the commit
    git revert HEAD
    git push origin main
    
  • Flux will recreate AlertManager with cephfs-shared
  • Alert state will rebuild (acceptable data loss)

Migration Procedures

Pattern 1: Volsync Component Apps (GitOps Workflow)

Used for: CouchDB, and any app using the volsync component

Steps:

  1. Update ks.yaml - Add storage class overrides to postBuild.substitute:

    postBuild:
      substitute:
        APP: obsidian-couchdb
        VOLSYNC_CAPACITY: 5Gi
        VOLSYNC_STORAGECLASS: ceph-rbd              # Changed from default
        VOLSYNC_ACCESSMODES: ReadWriteOnce          # Changed from ReadWriteMany
        VOLSYNC_SNAPSHOTCLASS: ceph-rbd-snapshot    # Changed from cephfs-snapshot
        VOLSYNC_CACHE_STORAGECLASS: ceph-rbd        # For volsync cache
        VOLSYNC_CACHE_ACCESSMODES: ReadWriteOnce    # For volsync cache
    
  2. Commit and push changes to Git repository

  3. Flux reconciles automatically:

    • Flux detects the change in Git
    • Recreates the PVC with new storage class
    • Volsync ReplicationDestination restores data from backup
    • Application pod starts with new RBD-backed storage
  4. Verify the application is running correctly with new storage:

    kubectl get pvc -n <namespace> <app>
    kubectl describe pvc -n <namespace> <app> | grep StorageClass
    

Example files:

  • CouchDB: kubernetes/apps/selfhosted/obsidian-couchdb/ks.yaml

Pattern 2: Direct HelmRelease Apps (GitOps Workflow)

Used for: Prometheus, Loki, AlertManager

Steps:

For Prometheus & AlertManager:

  1. Update helmrelease.yaml - Change storageClassName in volumeClaimTemplate:

    # kubernetes/apps/observability/kube-prometheus-stack/app/helmrelease.yaml
    prometheus:
      prometheusSpec:
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd  # Changed from cephfs-shared
              resources:
                requests:
                  storage: 100Gi
    
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: ceph-rbd  # Changed from cephfs-shared
              resources:
                requests:
                  storage: 2Gi
    
  2. Commit and push changes to Git repository

  3. Flux reconciles automatically:

    • Flux detects the HelmRelease change
    • Helm recreates the StatefulSet
    • New PVCs created with ceph-rbd storage class
    • Pods start with new storage (data loss acceptable for metrics/alerts)

For Loki:

  1. Update helmrelease.yaml - Change storageClass in persistence config:

    # kubernetes/apps/observability/loki/app/helmrelease.yaml
    singleBinary:
      persistence:
        enabled: true
        storageClass: ceph-rbd  # Changed from cephfs-shared
        size: 30Gi
    
  2. Commit and push changes to Git repository

  3. Flux reconciles automatically - Same process as Prometheus

Note: For observability workloads, some data loss during migration is typically acceptable since:

  • Prometheus has 14d retention - new data will accumulate
  • Loki has 14d retention - new logs will accumulate
  • AlertManager state is ephemeral and will rebuild

For Services Without Storage (Dragonfly, EMQX)

Steps:

  1. Update CRD to add volumeClaimTemplates with ceph-rbd
  2. Commit and push changes
  3. Flux recreates StatefulSet with persistent storage
  4. Configure volsync backup strategy (optional)

Important Migration Considerations

Snapshot Class Compatibility

When migrating from CephFS to Ceph RBD, snapshot classes must match the storage backend:

Storage ClassCompatible Snapshot Class
cephfs-sharedcephfs-snapshot
ceph-rbdceph-rbd-snapshot

Why this matters:

  • Volsync uses snapshots for backup/restore operations
  • Using the wrong snapshot class will cause volsync to fail
  • Both the main storage and cache storage need matching snapshot classes

Available VolumeSnapshotClasses in cluster:

$ kubectl get volumesnapshotclass
NAME                DRIVER                          DELETIONPOLICY
ceph-rbd-snapshot   rook-ceph.rbd.csi.ceph.com      Delete
cephfs-snapshot     rook-ceph.cephfs.csi.ceph.com   Delete
csi-nfs-snapclass   nfs.csi.k8s.io                  Delete

Access Mode Changes

Storage TypeAccess ModeUse Case
CephFS (cephfs-shared)ReadWriteMany (RWX)Shared filesystems, media libraries
Ceph RBD (ceph-rbd)ReadWriteOnce (RWO)Databases, block storage

Impact:

  • RBD volumes can only be mounted by one node at a time
  • Applications must be single-replica or use StatefulSet with pod affinity
  • Most database workloads already use RWO - minimal impact

Volsync Cache Storage

When using volsync with RBD, both the main storage and cache storage should use RBD:

postBuild:
  substitute:
    # Main PVC settings
    VOLSYNC_STORAGECLASS: ceph-rbd
    VOLSYNC_ACCESSMODES: ReadWriteOnce
    VOLSYNC_SNAPSHOTCLASS: ceph-rbd-snapshot

    # Cache PVC settings (must also match RBD)
    VOLSYNC_CACHE_STORAGECLASS: ceph-rbd
    VOLSYNC_CACHE_ACCESSMODES: ReadWriteOnce
    VOLSYNC_CACHE_CAPACITY: 10Gi

Why? Mixing CephFS cache with RBD main storage can cause:

  • Snapshot compatibility issues
  • Performance inconsistencies
  • Backup/restore failures

Technical Notes

  • Ceph RBD Pool: Backed by rook-pvc-pool
  • Storage Class: ceph-rbd
  • Access Mode: RWO (ReadWriteOnce) - single node access
  • Features: Volume expansion enabled, snapshot support
  • Reclaim Policy: Delete
  • CSI Driver: rook-ceph.rbd.csi.ceph.com

References

  • Current cluster storage: kubernetes/apps/storage/
  • Database configs: kubernetes/apps/database/*/cluster/cluster.yaml
  • Storage class definition: Managed by Rook operator

VPA-Based Resource Limit Updates

Summary

This document outlines a plan to systematically update resource limits across the cluster based on VPA (Vertical Pod Autoscaler) recommendations from Goldilocks to eliminate CPU throttling alerts.

Changes Already Made

1. Alert Configuration

File: kubernetes/apps/observability/kube-prometheus-stack/app/alertmanagerconfig.yaml

  • Changed default receiver from pushover to "null"
  • Added explicit routes for severity: warning and severity: critical to pushover
  • Result: Only critical and warning alerts will trigger pushover notifications (no more info-level spam)

2. Promtail Resources

File: kubernetes/apps/observability/promtail/app/helmrelease.yaml

  • CPU Request: 50m → 100m
  • CPU Limit: 100m → 250m
  • Rationale: VPA recommends 101m upper bound, but we added headroom for log bursts

Priority Workloads for Update

High Priority (Currently Throttling or at Risk)

Observability Namespace

  1. Loki - Log aggregation

    • Current: cpu: 35m request, 200m limit
    • VPA: cpu: 23m request, 140m limit
    • Action: Keep current limits (already adequate)
  2. Grafana - Visualization

    • Current: No CPU limits
    • VPA: cpu: 63m request, 213m limit
    • Action: Add limits - 100m request, 500m limit for burst capacity
  3. Internal Nginx Ingress (network namespace)

    • Current: cpu: 500m request, no limit
    • VPA: cpu: 63m request, 316m limit
    • Action: Add 500m limit (keep generous for traffic spikes)

Medium Priority (Good to standardize)

Observability Namespace

  1. kube-state-metrics

    • VPA: cpu: 23m request, 77m limit
    • Action: Add resources block
  2. Goldilocks Controller

    • VPA: cpu: 587m request, 2268m limit (!)
    • Action: Add generous limits for this workload
  3. Blackbox Exporter

    • VPA: cpu: 15m request, 37m limit
    • Action: Add resources block

Network Namespace

  1. External Nginx Ingress

    • VPA: cpu: 49m request, 165m limit
    • Action: Add resources block
  2. Cloudflared

    • VPA: cpu: 15m request, 214m limit
    • Action: Add resources block (note the high burst)

Low Priority (Already well-configured)

  • Node Exporter: Current limits are generous (250m limit vs 22m VPA)
  • DCGM Exporter: Has limits, VPA shows adequate
  • Media workloads: Most have no CPU limits (intentional for high CPU apps like Plex, Bazarr)

Implementation Strategy

Phase 1: Stop the Alerts (DONE ✅)

  • Update alertmanagerconfig to filter by severity
  • Update promtail CPU limits

Phase 2: Observability Namespace (Next)

Update these critical monitoring components:

  • Grafana - Add CPU limits
  • kube-state-metrics - Add resources
  • Goldilocks controller - Add resources
  • Blackbox exporter - Add resources

Phase 3: Network Infrastructure

  • Internal nginx ingress - Add CPU limit
  • External nginx ingress - Add resources
  • Cloudflared - Add resources

Phase 4: Optional Refinements

  • Review VPA recommendations quarterly
  • Adjust limits based on actual usage patterns
  • Consider enabling VPA auto-mode for non-critical workloads

How to Use VPA Recommendations

1. View All Recommendations

# Run the helper script
./scripts/vpa-resource-recommendations.sh

# Or visit the dashboard
open https://goldilocks.chelonianlabs.com

2. Get Specific Workload Recommendations

kubectl get vpa -n observability goldilocks-grafana -o jsonpath='{.status.recommendation.containerRecommendations[0]}' | jq

3. Update HelmRelease

Add resources block under values::

values:
  resources:
    requests:
      cpu: <vpa_target>
      memory: <vpa_target_memory>
    limits:
      cpu: <vpa_upper_or_2x_for_bursts>
      memory: <vpa_upper_memory>

4. Apply and Monitor

# Commit changes
git add kubernetes/apps/observability/grafana/app/helmrelease.yaml
git commit -m "feat(grafana): add CPU limits based on VPA recommendations"
git push

# Force reconciliation (optional)
flux reconcile helmrelease -n observability grafana

# Monitor for throttling
kubectl top pods -n observability --containers

VPA Interpretation Guide

VPA Recommendation Fields:

  • target: Use as your request value
  • lowerBound: Minimum to function
  • upperBound: Use as limit (or higher for burst workloads)
  • uncappedTarget: What VPA thinks is ideal without constraints

When to Deviate:

  • Burst workloads (logs, ingress): Use 2-3x upper bound for limits
  • Background jobs: Match VPA recommendations closely
  • User-facing apps: Add 50-100% headroom for traffic spikes
  • Resource-constrained: Start with target, monitor, then adjust

Monitoring for Success

After updates, verify alerts have stopped:

# Check for CPU throttling alerts
kubectl get alerts -A | grep -i throttl

# Check actual CPU usage vs limits
kubectl top pods -A --containers | sort -k4 -h -r | head -20

# Review VPA over time
watch kubectl get vpa -n observability

Tools Created

  1. scripts/vpa-resource-recommendations.sh - Extract VPA recommendations with HelmRelease file locations
  2. This document - Implementation plan and guidance

References