Dapper Cluster Documentation

This documentation covers the architecture, configuration, and operations of the Dapper Kubernetes cluster, a high-performance home lab infrastructure with GPU capabilities.

Cluster Overview

graph TD
    subgraph Control Plane
        CP1[Control Plane 1<br>4 CPU, 16GB]
        CP2[Control Plane 2<br>4 CPU, 16GB]
        CP3[Control Plane 3<br>4 CPU, 16GB]
    end

    subgraph Worker Nodes
        W1[Worker 1<br>16 CPU, 128GB]
        W2[Worker 2<br>16 CPU, 128GB]
        GPU[GPU Node<br>16 CPU, 128GB<br>4x Tesla P100]
    end

    CP1 --- CP2
    CP2 --- CP3
    CP3 --- CP1

    Control Plane --> Worker Nodes

Hardware Specifications

Control Plane

3 nodes for high availability
4 CPU cores per node
16GB RAM per node
Dedicated to cluster control plane operations

Worker Nodes

2 general-purpose worker nodes
16 CPU cores per node
128GB RAM per node
Handles general workloads and applications

GPU Node

Specialized GPU worker node
16 CPU cores
128GB RAM
4x NVIDIA Tesla P100 GPUs
Handles ML/AI and GPU-accelerated workloads

Key Features

High-availability Kubernetes cluster
GPU acceleration support
Automated deployment using Flux CD
Secure secrets management with SOPS
NFS and OpenEBS storage integration
Comprehensive monitoring and observability
Media services automation

Infrastructure Components

graph TD
    subgraph Core Services
        Flux[Flux CD]
        Storage[Storage Layer]
        Network[Network Layer]
    end

    subgraph Applications
        Media[Media Stack]
        Monitor[Monitoring]
        GPU[GPU Workloads]
    end

    Core Services --> Applications

    Storage --> |NFS/OpenEBS| Applications
    Network --> |Ingress/DNS| Applications

Documentation Structure

Architecture: Detailed technical documentation about cluster design and components
- High-availability control plane design
- Storage architecture and configuration
- Network topology and policies
- GPU integration and management
Applications: Information about deployed applications and their configurations
- Media services stack
- Monitoring and observability
- GPU-accelerated applications
Operations: Guides for installation, maintenance, and troubleshooting
- Cluster setup procedures
- Node management
- GPU configuration
- Maintenance tasks

Getting Started

For new users, we recommend starting with:

Architecture Overview - Understanding the cluster design
Installation Guide - Setting up the cluster
Application Stack - Deploying applications

Quick Links

Architecture Overview

Cluster Architecture

graph TD
    subgraph Control Plane
        CP1[Control Plane 1<br>4 CPU, 16GB]
        CP2[Control Plane 2<br>4 CPU, 16GB]
        CP3[Control Plane 3<br>4 CPU, 16GB]

        CP1 --- CP2
        CP2 --- CP3
        CP3 --- CP1
    end

    subgraph Worker Nodes
        W1[Worker 1<br>16 CPU, 128GB]
        W2[Worker 2<br>16 CPU, 128GB]
    end

    subgraph GPU Node
        GPU[GPU Worker<br>16 CPU, 128GB<br>4x Tesla P100]
    end

    Control Plane --> Worker Nodes
    Control Plane --> GPU

Core Components

Control Plane

High Availability: 3-node control plane configuration
Resource Allocation: 4 CPU, 16GB RAM per node
Components:
- etcd cluster
- API Server
- Controller Manager
- Scheduler

Worker Nodes

General Purpose Workers: 2 nodes
Resources per Node:
- 16 CPU cores
- 128GB RAM
Workload Types:
- Application deployments
- Database workloads
- Media services
- Monitoring systems

GPU Node

Specialized Worker: 1 node
Hardware:
- 16 CPU cores
- 128GB RAM
- 4x NVIDIA Tesla P100 GPUs
Workload Types:
- ML/AI workloads
- Video transcoding
- GPU-accelerated applications

Network Architecture

graph TD
    subgraph External
        Internet((Internet))
        DNS((DNS))
    end

    subgraph Network Edge
        FW[Firewall]
        LB[Load Balancer]
    end

    subgraph Kubernetes Network
        CP[Control Plane]
        Workers[Worker Nodes]
        GPUNode[GPU Node]

        subgraph Services
            Ingress[Ingress Controller]
            CoreDNS[CoreDNS]
            CNI[Network Plugin]
        end
    end

    Internet --> FW
    DNS --> FW
    FW --> LB
    LB --> CP
    CP --> Workers
    CP --> GPUNode
    Services --> Workers
    Services --> GPUNode

Storage Architecture

graph TD
    subgraph Storage Classes
        NFS[NFS Storage Class]
        OpenEBS[OpenEBS Storage Class]
    end

    subgraph Persistent Volumes
        NFS --> NFS_PV[NFS PVs]
        OpenEBS --> Local_PV[Local PVs]
    end

    subgraph Workload Types
        NFS_PV --> Media[Media Storage]
        NFS_PV --> Shared[Shared Config]
        Local_PV --> DB[Databases]
        Local_PV --> Cache[Cache Storage]
    end

Security Considerations

Network segmentation using Kubernetes network policies
Encrypted secrets management with SOPS
TLS encryption for all external services
Regular security updates via automated pipelines
GPU access controls and resource quotas

Scalability

The cluster architecture is designed to be scalable:

High-availability control plane (3 nodes)
Expandable worker node pool
Specialized GPU node for compute-intensive tasks
Dynamic storage provisioning
Load balancing for external services
Resource quotas and limits management

Monitoring and Observability

graph LR
    subgraph Monitoring Stack
        Prom[Prometheus]
        Graf[Grafana]
        Alert[Alertmanager]
    end

    subgraph Node Types
        CP[Control Plane Metrics]
        Work[Worker Metrics]
        GPU[GPU Metrics]
    end

    CP --> Prom
    Work --> Prom
    GPU --> Prom
    Prom --> Graf
    Prom --> Alert

Resource Management

Control Plane

Reserved for kubernetes control plane components
Optimized for control plane operations
High availability configuration

Worker Nodes

General purpose workloads
Balanced resource allocation
Flexible scheduling options

GPU Node

Dedicated for GPU workloads
NVIDIA GPU operator integration
Specialized resource scheduling

Network Architecture

Network Overview

graph TD
    subgraph External
        Internet((Internet))
        DNS((DNS))
    end

    subgraph Network Edge
        FW[Firewall]
        LB[Load Balancer]
        Internet --> FW
        DNS --> FW
        FW --> LB
    end

    subgraph Kubernetes Network
        subgraph Ingress
            LB --> Traefik[Traefik]
            Traefik --> Services[Internal Services]
        end

        subgraph Network Policies
            Services --> Apps[Applications]
            Services --> DBs[Databases]
        end

        subgraph CNI[Container Network]
            Apps --> Pod1[Pod Network]
            DBs --> Pod1
        end
    end

Components

Ingress Controller

Traefik: Main ingress controller
- SSL/TLS termination
- Automatic certificate management
- Route configuration
- Load balancing

Network Policies

graph LR
    subgraph Policies
        Default[Default Deny]
        Allow[Allowed Routes]
    end

    subgraph Apps
        Media[Media Stack]
        Monitor[Monitoring]
        DB[Databases]
    end

    Allow --> Media
    Allow --> Monitor
    Default --> DB

DNS Configuration

External DNS for automatic DNS management
Internal DNS resolution
Split DNS configuration

Security

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

TLS Configuration

Automatic certificate management via cert-manager
Let's Encrypt integration
Internal PKI for service mesh

Service Mesh

Traffic Flow

graph LR
    subgraph Ingress
        External[External Traffic]
        Traefik[Traefik]
    end

    subgraph Services
        App1[Service 1]
        App2[Service 2]
        DB[Database]
    end

    External --> Traefik
    Traefik --> App1
    Traefik --> App2
    App1 --> DB
    App2 --> DB

Best Practices

Security
- Implement default deny policies
- Use TLS everywhere
- Regular security audits
- Network segmentation
Performance
- Load balancer optimization
- Connection pooling
- Proper resource allocation
- Traffic monitoring
Reliability
- High availability configuration
- Failover planning
- Backup routes
- Health checks
Monitoring
- Network metrics collection
- Traffic analysis
- Latency monitoring
- Bandwidth usage tracking

Troubleshooting

Common network issues and resolution steps:

Connectivity Issues
- Check network policies
- Verify DNS resolution
- Inspect service endpoints
- Review ingress configuration
Performance Problems
- Monitor network metrics
- Check for bottlenecks
- Analyze traffic patterns
- Review resource allocation

Storage Architecture

Storage Overview

graph TD
    subgraph Storage Classes
        NFS[NFS Storage Class]
        OpenEBS[OpenEBS Storage Class]
    end

    subgraph Persistent Volumes
        NFS --> NFS_PV[NFS PVs]
        OpenEBS --> Local_PV[Local PVs]
    end

    subgraph Applications
        NFS_PV --> Media[Media Apps]
        NFS_PV --> Backup[Backup Storage]
        Local_PV --> Database[Databases]
        Local_PV --> Cache[Cache Storage]
    end

Storage Classes

NFS Storage Class

Used for shared storage across nodes
Ideal for media storage and shared configurations
Supports ReadWriteMany (RWX) access mode
Configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs-server
  share: /export/nfs

OpenEBS Storage Class

Local storage for performance-critical applications
Used for databases and caching layers
Supports ReadWriteOnce (RWO) access mode
Configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: openebs-local
provisioner: openebs.io/local
volumeBindingMode: WaitForFirstConsumer

Storage Considerations

Performance

Use OpenEBS local storage for:
- Databases requiring low latency
- Cache storage
- Write-intensive workloads
Use NFS storage for:
- Media files
- Shared configurations
- Backup storage
- Read-intensive workloads

Backup Strategy

graph LR
    Apps[Applications] --> PV[Persistent Volumes]
    PV --> Backup[Backup Jobs]
    Backup --> Remote[Remote Storage]

Volume Snapshots

Regular snapshots for data protection
Snapshot schedules based on data criticality
Retention policies for space management

Best Practices

Storage Class Selection
- Use appropriate storage class based on workload requirements
- Consider access modes needed by applications
- Account for performance requirements
Resource Management
- Set appropriate storage quotas
- Monitor storage usage
- Plan for capacity expansion
Data Protection
- Regular backups
- Snapshot scheduling
- Replication where needed
Performance Optimization
- Use local storage for performance-critical workloads
- Implement caching strategies
- Monitor I/O patterns

Media Applications

Media Stack Overview

graph TD
    subgraph Media Management
        Wizarr[Wizarr]
        Plex[Plex Media Server]
        Wizarr --> Plex
    end

    subgraph Storage
        NFS[NFS Storage]
        Plex --> NFS
    end

    subgraph Access Control
        Auth[Authentication]
        Wizarr --> Auth
    end

Components

Media Server

Plex Media Server
- Media streaming service
- Transcoding capabilities
- Library management
- Multi-user support

User Management

Wizarr
- User invitation system
- Plex account management
- Access control
- Integration with authentication

Storage Configuration

Media Storage

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: media-storage
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: nfs-csi
  resources:
    requests:
      storage: 1Ti

Network Configuration

Service Configuration

Internal service discovery
External access through ingress
Secure connections with TLS

Example Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: media-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  rules:
    - host: plex.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: plex
                port:
                  number: 32400

Resource Management

Resource Allocation

CPU and memory limits
Storage quotas
Network bandwidth considerations

Example Resource Configuration

resources:
  limits:
    cpu: "4"
    memory: 8Gi
  requests:
    cpu: "2"
    memory: 4Gi

Maintenance

Backup Strategy

graph LR
    Media[Media Files] --> Backup[Backup Storage]
    Config[Configurations] --> Backup
    Meta[Metadata] --> Backup

Regular Tasks

Database backups
Configuration backups
Media library scans
Storage cleanup

Monitoring

Key Metrics

Server health
Transcoding performance
Storage usage
Network bandwidth
User activity

Alerts

Storage capacity warnings
Service availability
Performance degradation
Failed transcoding jobs

Troubleshooting

Common issues and resolution steps:

Streaming Issues
- Check network connectivity
- Verify transcoding settings
- Monitor resource usage
- Review logs
Storage Problems
- Verify mount points
- Check permissions
- Monitor disk space
- Review I/O performance
User Access Issues
- Verify authentication
- Check authorization
- Review user permissions
- Check invitation system

Observability

Installation Guide

Prerequisites

graph TD
    subgraph Hardware
        CP[Control Plane Nodes]
        GPU[GPU Worker Node]
        Worker[Worker Nodes]
    end

    subgraph Software
        OS[Operating System]
        Tools[Required Tools]
        Network[Network Setup]
    end

    subgraph Configuration
        Git[Git Repository]
        Secrets[SOPS Setup]
        Certs[Certificates]
    end

Hardware Requirements

Control Plane Nodes (3x)

CPU: 4 cores per node
RAM: 16GB per node
Role: Cluster control plane

GPU Worker Node (1x)

CPU: 16 cores
RAM: 128GB
GPU: 4x NVIDIA Tesla P100
Role: GPU-accelerated workloads

Worker Nodes (2x)

CPU: 16 cores per node
RAM: 128GB per node
Role: General workloads

Software Prerequisites

Operating System
- Linux distribution
- Updated system packages
- Required kernel modules
- NVIDIA drivers (for GPU node)
Required Tools
- kubectl
- flux
- SOPS
- age/gpg
- task

Initial Setup

1. Repository Setup

# Clone the repository
git clone https://github.com/username/dapper-cluster.git
cd dapper-cluster

# Create configuration
cp config.sample.yaml config.yaml

2. Configuration

graph LR
    Config[Configuration] --> Secrets[Secrets Management]
    Config --> Network[Network Settings]
    Config --> Storage[Storage Setup]
    Secrets --> SOPS[SOPS Encryption]
    Network --> DNS[DNS Setup]
    Storage --> CSI[CSI Drivers]

Edit Configuration

cluster:
  name: dapper-cluster
  domain: example.com

network:
  cidr: 10.0.0.0/16

storage:
  nfs:
    server: nfs.example.com
    path: /export/nfs

3. Secrets Management

Generate age key
Configure SOPS
Encrypt sensitive files

4. Bootstrap Process

graph TD
    Start[Start Installation] --> CP[Bootstrap Control Plane]
    CP --> Workers[Join Worker Nodes]
    Workers --> GPU[Configure GPU Node]
    GPU --> Flux[Install Flux]
    Flux --> Apps[Deploy Apps]

Bootstrap Commands

# Initialize flux
task flux:bootstrap

# Verify installation
task cluster:verify

# Verify GPU support
kubectl get nodes -o wide
nvidia-smi # on GPU node

Post-Installation

1. Verify Components

Check control plane health
Verify worker node status
Test GPU functionality
Check storage provisioners
Verify network connectivity

2. Deploy Applications

Deploy core services
Configure monitoring
Setup backup systems
Deploy GPU-enabled workloads

3. Security Setup

Configure network policies
Setup certificate management
Enable monitoring and alerts
Secure GPU access

Troubleshooting

Common installation issues and solutions:

Control Plane Issues
- Verify etcd cluster health
- Check control plane components
- Review system logs
Worker Node Issues
- Verify node join process
- Check kubelet status
- Review node logs
GPU Node Issues
- Verify NVIDIA driver installation
- Check NVIDIA container runtime
- Validate GPU visibility in cluster
Storage Issues
- Verify NFS connectivity
- Check storage class configuration
- Review PV/PVC status
Network Problems
- Check DNS resolution
- Verify network policies
- Review ingress configuration

Maintenance

Regular Tasks

System updates
Certificate renewal
Backup verification
Security audits
GPU driver updates

Health Checks

Component status
Resource usage
Storage capacity
Network connectivity
GPU health

Next Steps

After successful installation:

Review Architecture Overview
Configure Storage
Setup Network
Deploy Applications

Maintenance Guide

Maintenance Overview

graph TD
    subgraph Regular Tasks
        Updates[System Updates]
        Backups[Backup Tasks]
        Monitoring[Health Checks]
    end

    subgraph Periodic Tasks
        Audit[Security Audits]
        Cleanup[Resource Cleanup]
        Review[Config Review]
    end

    Updates --> Verify[Verification]
    Backups --> Test[Backup Testing]
    Monitoring --> Alert[Alert Response]

Regular Maintenance

Daily Tasks

Monitor system health
- Check cluster status
- Review resource usage
- Verify backup completion
- Check alert status

Weekly Tasks

Review system logs
Check storage usage
Verify backup integrity
Update documentation

Monthly Tasks

Security updates
Certificate rotation
Resource optimization
Performance review

Update Procedures

Flux Updates

graph LR
    PR[Pull Request] --> Review[Review Changes]
    Review --> Test[Test Environment]
    Test --> Deploy[Deploy to Prod]
    Deploy --> Monitor[Monitor Status]

Application Updates

Review release notes
Test in staging if available
Update flux manifests
Monitor deployment
Verify functionality

Backup Management

Backup Strategy

graph TD
    Apps[Applications] --> Data[Data Backup]
    Config[Configurations] --> Git[Git Repository]
    Secrets[Secrets] --> Vault[Secret Storage]

    Data --> Verify[Verification]
    Git --> Verify
    Vault --> Verify

Backup Verification

Regular restore testing
Data integrity checks
Recovery time objectives
Backup retention policy

Resource Management

Cleanup Procedures

Remove unused resources
- Orphaned PVCs
- Completed jobs
- Old backups
- Unused configs
Storage optimization
- Compress old logs
- Archive unused data
- Clean container cache

Monitoring and Alerts

Key Metrics

Node health
Pod status
Resource usage
Storage capacity
Network performance

Alert Response

Acknowledge alert
Assess impact
Investigate root cause
Apply fix
Document resolution

Security Maintenance

Regular Tasks

graph TD
    Audit[Security Audit] --> Review[Review Findings]
    Review --> Update[Update Policies]
    Update --> Test[Test Changes]
    Test --> Document[Document Changes]

Security Checklist

Review network policies
Check certificate expiration
Audit access controls
Review secret rotation
Scan for vulnerabilities

Troubleshooting Guide

Common Issues

Node Problems
- Check node status
- Review system logs
- Verify resource usage
- Check connectivity
Storage Issues
- Verify mount points
- Check permissions
- Monitor capacity
- Review I/O performance
Network Problems
- Check DNS resolution
- Verify network policies
- Review ingress status
- Test connectivity

Recovery Procedures

Node Recovery

# Check node status
kubectl get nodes

# Drain node for maintenance
kubectl drain node-name

# Perform maintenance
# ...

# Uncordon node
kubectl uncordon node-name

Storage Recovery

# Check PV status
kubectl get pv

# Check PVC status
kubectl get pvc

# Verify storage class
kubectl get sc

Documentation

Maintenance Logs

Keep detailed records
Document changes
Track issues
Update procedures

Review Process

Regular documentation review
Update procedures
Verify accuracy
Add new sections

Best Practices

Change Management
- Use git workflow
- Test changes
- Document updates
- Monitor results
Resource Management
- Regular cleanup
- Optimize usage
- Monitor trends
- Plan capacity
Security
- Regular audits
- Update policies
- Monitor access
- Review logs

Troubleshooting Guide

Diagnostic Workflow

graph TD
    Issue[Issue Detected] --> Triage[Triage]
    Triage --> Diagnose[Diagnose]
    Diagnose --> Fix[Apply Fix]
    Fix --> Verify[Verify]
    Verify --> Document[Document]

Common Issues

1. Cluster Health Issues

Node Problems

graph TD
    Node[Node Issue] --> Check[Check Status]
    Check --> |Healthy| Resources[Resource Issue]
    Check --> |Unhealthy| System[System Issue]
    Resources --> Memory[Memory]
    Resources --> CPU[CPU]
    Resources --> Disk[Disk]
    System --> Logs[Check Logs]
    System --> Network[Network]

Diagnosis Steps:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check system resources
kubectl top nodes
kubectl top pods --all-namespaces

# Check system logs
kubectl logs -n kube-system <pod-name>

2. Storage Issues

Volume Problems

graph LR
    PV[PV Issue] --> Status[Check Status]
    Status --> |Bound| Access[Access Issue]
    Status --> |Pending| Provision[Provisioning Issue]
    Status --> |Failed| Storage[Storage System]

Resolution Steps:

# Check PV/PVC status
kubectl get pv,pvc --all-namespaces

# Check storage class
kubectl get sc

# Check provisioner pods
kubectl get pods -n storage

3. Network Issues

Connectivity Problems

graph TD
    Net[Network Issue] --> DNS[DNS Check]
    Net --> Ingress[Ingress Check]
    Net --> Policy[Network Policy]

    DNS --> CoreDNS[CoreDNS Pods]
    Ingress --> Traefik[Traefik Logs]
    Policy --> Rules[Policy Rules]

Diagnostic Commands:

# Check DNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check ingress
kubectl get ingress --all-namespaces
kubectl describe ingress <ingress-name> -n <namespace>

4. Application Issues

Pod Problems

graph TD
    Pod[Pod Issue] --> Status[Check Status]
    Status --> |Pending| Schedule[Scheduling]
    Status --> |CrashLoop| Crash[Container Crash]
    Status --> |Error| Logs[Check Logs]

Troubleshooting Steps:

# Check pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Flux Issues

GitOps Troubleshooting

graph TD
    Flux[Flux Issue] --> Source[Source Controller]
    Flux --> Kust[Kustomize Controller]
    Flux --> Helm[Helm Controller]

    Source --> Git[Git Repository]
    Kust --> Sync[Sync Status]
    Helm --> Release[Release Status]

Resolution Steps:

# Check Flux components
flux check

# Check sources
flux get sources git
flux get sources helm

# Check reconciliation
flux get kustomizations
flux get helmreleases

Performance Issues

Resource Constraints

graph LR
    Perf[Performance] --> CPU[CPU Usage]
    Perf --> Memory[Memory Usage]
    Perf --> IO[I/O Usage]

    CPU --> Limit[Resource Limits]
    Memory --> Constraint[Memory Constraints]
    IO --> Bottleneck[I/O Bottleneck]

Analysis Commands:

# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# Check resource quotas
kubectl get resourcequota -n <namespace>

Recovery Procedures

1. Node Recovery

Drain node
Perform maintenance
Uncordon node
Verify workloads

2. Storage Recovery

Backup data
Fix storage issues
Restore data
Verify access

3. Network Recovery

Check connectivity
Verify DNS
Test ingress
Update policies

Best Practices

1. Logging

Maintain detailed logs
Set appropriate retention
Use structured logging
Enable audit logging

2. Monitoring

Set up alerts
Monitor resources
Track metrics
Use dashboards

3. Documentation

Document issues
Record solutions
Update procedures
Share knowledge

Emergency Procedures

Critical Issues

Assess impact
Implement temporary fix
Plan permanent solution
Update documentation

Contact Information

Maintain escalation paths
Keep contact list updated
Document response times
Track incidents