Dapper Cluster Documentation
This documentation covers the architecture, configuration, and operations of the Dapper Kubernetes cluster, a high-performance home lab infrastructure with GPU capabilities.
Cluster Overview
graph TD subgraph Control Plane CP1[Control Plane 1<br>4 CPU, 16GB] CP2[Control Plane 2<br>4 CPU, 16GB] CP3[Control Plane 3<br>4 CPU, 16GB] end subgraph Worker Nodes W1[Worker 1<br>16 CPU, 128GB] W2[Worker 2<br>16 CPU, 128GB] GPU[GPU Node<br>16 CPU, 128GB<br>4x Tesla P100] end CP1 --- CP2 CP2 --- CP3 CP3 --- CP1 Control Plane --> Worker Nodes
Hardware Specifications
Control Plane
- 3 nodes for high availability
- 4 CPU cores per node
- 16GB RAM per node
- Dedicated to cluster control plane operations
Worker Nodes
- 2 general-purpose worker nodes
- 16 CPU cores per node
- 128GB RAM per node
- Handles general workloads and applications
GPU Node
- Specialized GPU worker node
- 16 CPU cores
- 128GB RAM
- 4x NVIDIA Tesla P100 GPUs
- Handles ML/AI and GPU-accelerated workloads
Key Features
- High-availability Kubernetes cluster
- GPU acceleration support
- Automated deployment using Flux CD
- Secure secrets management with SOPS
- NFS and OpenEBS storage integration
- Comprehensive monitoring and observability
- Media services automation
Infrastructure Components
graph TD subgraph Core Services Flux[Flux CD] Storage[Storage Layer] Network[Network Layer] end subgraph Applications Media[Media Stack] Monitor[Monitoring] GPU[GPU Workloads] end Core Services --> Applications Storage --> |NFS/OpenEBS| Applications Network --> |Ingress/DNS| Applications
Documentation Structure
-
Architecture: Detailed technical documentation about cluster design and components
- High-availability control plane design
- Storage architecture and configuration
- Network topology and policies
- GPU integration and management
-
Applications: Information about deployed applications and their configurations
- Media services stack
- Monitoring and observability
- GPU-accelerated applications
-
Operations: Guides for installation, maintenance, and troubleshooting
- Cluster setup procedures
- Node management
- GPU configuration
- Maintenance tasks
Getting Started
For new users, we recommend starting with:
- Architecture Overview - Understanding the cluster design
- Installation Guide - Setting up the cluster
- Application Stack - Deploying applications
Quick Links
Architecture Overview
Cluster Architecture
graph TD subgraph Control Plane CP1[Control Plane 1<br>4 CPU, 16GB] CP2[Control Plane 2<br>4 CPU, 16GB] CP3[Control Plane 3<br>4 CPU, 16GB] CP1 --- CP2 CP2 --- CP3 CP3 --- CP1 end subgraph Worker Nodes W1[Worker 1<br>16 CPU, 128GB] W2[Worker 2<br>16 CPU, 128GB] end subgraph GPU Node GPU[GPU Worker<br>16 CPU, 128GB<br>4x Tesla P100] end Control Plane --> Worker Nodes Control Plane --> GPU
Core Components
Control Plane
- High Availability: 3-node control plane configuration
- Resource Allocation: 4 CPU, 16GB RAM per node
- Components:
- etcd cluster
- API Server
- Controller Manager
- Scheduler
Worker Nodes
- General Purpose Workers: 2 nodes
- Resources per Node:
- 16 CPU cores
- 128GB RAM
- Workload Types:
- Application deployments
- Database workloads
- Media services
- Monitoring systems
GPU Node
- Specialized Worker: 1 node
- Hardware:
- 16 CPU cores
- 128GB RAM
- 4x NVIDIA Tesla P100 GPUs
- Workload Types:
- ML/AI workloads
- Video transcoding
- GPU-accelerated applications
Network Architecture
graph TD subgraph External Internet((Internet)) DNS((DNS)) end subgraph Network Edge FW[Firewall] LB[Load Balancer] end subgraph Kubernetes Network CP[Control Plane] Workers[Worker Nodes] GPUNode[GPU Node] subgraph Services Ingress[Ingress Controller] CoreDNS[CoreDNS] CNI[Network Plugin] end end Internet --> FW DNS --> FW FW --> LB LB --> CP CP --> Workers CP --> GPUNode Services --> Workers Services --> GPUNode
Storage Architecture
graph TD subgraph Storage Classes NFS[NFS Storage Class] OpenEBS[OpenEBS Storage Class] end subgraph Persistent Volumes NFS --> NFS_PV[NFS PVs] OpenEBS --> Local_PV[Local PVs] end subgraph Workload Types NFS_PV --> Media[Media Storage] NFS_PV --> Shared[Shared Config] Local_PV --> DB[Databases] Local_PV --> Cache[Cache Storage] end
Security Considerations
- Network segmentation using Kubernetes network policies
- Encrypted secrets management with SOPS
- TLS encryption for all external services
- Regular security updates via automated pipelines
- GPU access controls and resource quotas
Scalability
The cluster architecture is designed to be scalable:
- High-availability control plane (3 nodes)
- Expandable worker node pool
- Specialized GPU node for compute-intensive tasks
- Dynamic storage provisioning
- Load balancing for external services
- Resource quotas and limits management
Monitoring and Observability
graph LR subgraph Monitoring Stack Prom[Prometheus] Graf[Grafana] Alert[Alertmanager] end subgraph Node Types CP[Control Plane Metrics] Work[Worker Metrics] GPU[GPU Metrics] end CP --> Prom Work --> Prom GPU --> Prom Prom --> Graf Prom --> Alert
Resource Management
Control Plane
- Reserved for kubernetes control plane components
- Optimized for control plane operations
- High availability configuration
Worker Nodes
- General purpose workloads
- Balanced resource allocation
- Flexible scheduling options
GPU Node
- Dedicated for GPU workloads
- NVIDIA GPU operator integration
- Specialized resource scheduling
Network Architecture
Network Overview
graph TD subgraph External Internet((Internet)) DNS((DNS)) end subgraph Network Edge FW[Firewall] LB[Load Balancer] Internet --> FW DNS --> FW FW --> LB end subgraph Kubernetes Network subgraph Ingress LB --> Traefik[Traefik] Traefik --> Services[Internal Services] end subgraph Network Policies Services --> Apps[Applications] Services --> DBs[Databases] end subgraph CNI[Container Network] Apps --> Pod1[Pod Network] DBs --> Pod1 end end
Components
Ingress Controller
- Traefik: Main ingress controller
- SSL/TLS termination
- Automatic certificate management
- Route configuration
- Load balancing
Network Policies
graph LR subgraph Policies Default[Default Deny] Allow[Allowed Routes] end subgraph Apps Media[Media Stack] Monitor[Monitoring] DB[Databases] end Allow --> Media Allow --> Monitor Default --> DB
DNS Configuration
- External DNS for automatic DNS management
- Internal DNS resolution
- Split DNS configuration
Security
Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
TLS Configuration
- Automatic certificate management via cert-manager
- Let's Encrypt integration
- Internal PKI for service mesh
Service Mesh
Traffic Flow
graph LR subgraph Ingress External[External Traffic] Traefik[Traefik] end subgraph Services App1[Service 1] App2[Service 2] DB[Database] end External --> Traefik Traefik --> App1 Traefik --> App2 App1 --> DB App2 --> DB
Best Practices
-
Security
- Implement default deny policies
- Use TLS everywhere
- Regular security audits
- Network segmentation
-
Performance
- Load balancer optimization
- Connection pooling
- Proper resource allocation
- Traffic monitoring
-
Reliability
- High availability configuration
- Failover planning
- Backup routes
- Health checks
-
Monitoring
- Network metrics collection
- Traffic analysis
- Latency monitoring
- Bandwidth usage tracking
Troubleshooting
Common network issues and resolution steps:
-
Connectivity Issues
- Check network policies
- Verify DNS resolution
- Inspect service endpoints
- Review ingress configuration
-
Performance Problems
- Monitor network metrics
- Check for bottlenecks
- Analyze traffic patterns
- Review resource allocation
Storage Architecture
Storage Overview
graph TD subgraph Storage Classes NFS[NFS Storage Class] OpenEBS[OpenEBS Storage Class] end subgraph Persistent Volumes NFS --> NFS_PV[NFS PVs] OpenEBS --> Local_PV[Local PVs] end subgraph Applications NFS_PV --> Media[Media Apps] NFS_PV --> Backup[Backup Storage] Local_PV --> Database[Databases] Local_PV --> Cache[Cache Storage] end
Storage Classes
NFS Storage Class
- Used for shared storage across nodes
- Ideal for media storage and shared configurations
- Supports ReadWriteMany (RWX) access mode
- Configuration:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
server: nfs-server
share: /export/nfs
OpenEBS Storage Class
- Local storage for performance-critical applications
- Used for databases and caching layers
- Supports ReadWriteOnce (RWO) access mode
- Configuration:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-local
provisioner: openebs.io/local
volumeBindingMode: WaitForFirstConsumer
Storage Considerations
Performance
- Use OpenEBS local storage for:
- Databases requiring low latency
- Cache storage
- Write-intensive workloads
- Use NFS storage for:
- Media files
- Shared configurations
- Backup storage
- Read-intensive workloads
Backup Strategy
graph LR Apps[Applications] --> PV[Persistent Volumes] PV --> Backup[Backup Jobs] Backup --> Remote[Remote Storage]
Volume Snapshots
- Regular snapshots for data protection
- Snapshot schedules based on data criticality
- Retention policies for space management
Best Practices
-
Storage Class Selection
- Use appropriate storage class based on workload requirements
- Consider access modes needed by applications
- Account for performance requirements
-
Resource Management
- Set appropriate storage quotas
- Monitor storage usage
- Plan for capacity expansion
-
Data Protection
- Regular backups
- Snapshot scheduling
- Replication where needed
-
Performance Optimization
- Use local storage for performance-critical workloads
- Implement caching strategies
- Monitor I/O patterns
Media Applications
Media Stack Overview
graph TD subgraph Media Management Wizarr[Wizarr] Plex[Plex Media Server] Wizarr --> Plex end subgraph Storage NFS[NFS Storage] Plex --> NFS end subgraph Access Control Auth[Authentication] Wizarr --> Auth end
Components
Media Server
- Plex Media Server
- Media streaming service
- Transcoding capabilities
- Library management
- Multi-user support
User Management
- Wizarr
- User invitation system
- Plex account management
- Access control
- Integration with authentication
Storage Configuration
Media Storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: media-storage
spec:
accessModes:
- ReadWriteMany
storageClassName: nfs-csi
resources:
requests:
storage: 1Ti
Network Configuration
Service Configuration
- Internal service discovery
- External access through ingress
- Secure connections with TLS
Example Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: media-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
rules:
- host: plex.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: plex
port:
number: 32400
Resource Management
Resource Allocation
- CPU and memory limits
- Storage quotas
- Network bandwidth considerations
Example Resource Configuration
resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: "2"
memory: 4Gi
Maintenance
Backup Strategy
graph LR Media[Media Files] --> Backup[Backup Storage] Config[Configurations] --> Backup Meta[Metadata] --> Backup
Regular Tasks
- Database backups
- Configuration backups
- Media library scans
- Storage cleanup
Monitoring
Key Metrics
- Server health
- Transcoding performance
- Storage usage
- Network bandwidth
- User activity
Alerts
- Storage capacity warnings
- Service availability
- Performance degradation
- Failed transcoding jobs
Troubleshooting
Common issues and resolution steps:
-
Streaming Issues
- Check network connectivity
- Verify transcoding settings
- Monitor resource usage
- Review logs
-
Storage Problems
- Verify mount points
- Check permissions
- Monitor disk space
- Review I/O performance
-
User Access Issues
- Verify authentication
- Check authorization
- Review user permissions
- Check invitation system
Observability
Installation Guide
Prerequisites
graph TD subgraph Hardware CP[Control Plane Nodes] GPU[GPU Worker Node] Worker[Worker Nodes] end subgraph Software OS[Operating System] Tools[Required Tools] Network[Network Setup] end subgraph Configuration Git[Git Repository] Secrets[SOPS Setup] Certs[Certificates] end
Hardware Requirements
Control Plane Nodes (3x)
- CPU: 4 cores per node
- RAM: 16GB per node
- Role: Cluster control plane
GPU Worker Node (1x)
- CPU: 16 cores
- RAM: 128GB
- GPU: 4x NVIDIA Tesla P100
- Role: GPU-accelerated workloads
Worker Nodes (2x)
- CPU: 16 cores per node
- RAM: 128GB per node
- Role: General workloads
Software Prerequisites
-
Operating System
- Linux distribution
- Updated system packages
- Required kernel modules
- NVIDIA drivers (for GPU node)
-
Required Tools
- kubectl
- flux
- SOPS
- age/gpg
- task
Initial Setup
1. Repository Setup
# Clone the repository
git clone https://github.com/username/dapper-cluster.git
cd dapper-cluster
# Create configuration
cp config.sample.yaml config.yaml
2. Configuration
graph LR Config[Configuration] --> Secrets[Secrets Management] Config --> Network[Network Settings] Config --> Storage[Storage Setup] Secrets --> SOPS[SOPS Encryption] Network --> DNS[DNS Setup] Storage --> CSI[CSI Drivers]
Edit Configuration
cluster:
name: dapper-cluster
domain: example.com
network:
cidr: 10.0.0.0/16
storage:
nfs:
server: nfs.example.com
path: /export/nfs
3. Secrets Management
- Generate age key
- Configure SOPS
- Encrypt sensitive files
4. Bootstrap Process
graph TD Start[Start Installation] --> CP[Bootstrap Control Plane] CP --> Workers[Join Worker Nodes] Workers --> GPU[Configure GPU Node] GPU --> Flux[Install Flux] Flux --> Apps[Deploy Apps]
Bootstrap Commands
# Initialize flux
task flux:bootstrap
# Verify installation
task cluster:verify
# Verify GPU support
kubectl get nodes -o wide
nvidia-smi # on GPU node
Post-Installation
1. Verify Components
- Check control plane health
- Verify worker node status
- Test GPU functionality
- Check storage provisioners
- Verify network connectivity
2. Deploy Applications
- Deploy core services
- Configure monitoring
- Setup backup systems
- Deploy GPU-enabled workloads
3. Security Setup
- Configure network policies
- Setup certificate management
- Enable monitoring and alerts
- Secure GPU access
Troubleshooting
Common installation issues and solutions:
-
Control Plane Issues
- Verify etcd cluster health
- Check control plane components
- Review system logs
-
Worker Node Issues
- Verify node join process
- Check kubelet status
- Review node logs
-
GPU Node Issues
- Verify NVIDIA driver installation
- Check NVIDIA container runtime
- Validate GPU visibility in cluster
-
Storage Issues
- Verify NFS connectivity
- Check storage class configuration
- Review PV/PVC status
-
Network Problems
- Check DNS resolution
- Verify network policies
- Review ingress configuration
Maintenance
Regular Tasks
- System updates
- Certificate renewal
- Backup verification
- Security audits
- GPU driver updates
Health Checks
- Component status
- Resource usage
- Storage capacity
- Network connectivity
- GPU health
Next Steps
After successful installation:
- Review Architecture Overview
- Configure Storage
- Setup Network
- Deploy Applications
Maintenance Guide
Maintenance Overview
graph TD subgraph Regular Tasks Updates[System Updates] Backups[Backup Tasks] Monitoring[Health Checks] end subgraph Periodic Tasks Audit[Security Audits] Cleanup[Resource Cleanup] Review[Config Review] end Updates --> Verify[Verification] Backups --> Test[Backup Testing] Monitoring --> Alert[Alert Response]
Regular Maintenance
Daily Tasks
- Monitor system health
- Check cluster status
- Review resource usage
- Verify backup completion
- Check alert status
Weekly Tasks
- Review system logs
- Check storage usage
- Verify backup integrity
- Update documentation
Monthly Tasks
- Security updates
- Certificate rotation
- Resource optimization
- Performance review
Update Procedures
Flux Updates
graph LR PR[Pull Request] --> Review[Review Changes] Review --> Test[Test Environment] Test --> Deploy[Deploy to Prod] Deploy --> Monitor[Monitor Status]
Application Updates
- Review release notes
- Test in staging if available
- Update flux manifests
- Monitor deployment
- Verify functionality
Backup Management
Backup Strategy
graph TD Apps[Applications] --> Data[Data Backup] Config[Configurations] --> Git[Git Repository] Secrets[Secrets] --> Vault[Secret Storage] Data --> Verify[Verification] Git --> Verify Vault --> Verify
Backup Verification
- Regular restore testing
- Data integrity checks
- Recovery time objectives
- Backup retention policy
Resource Management
Cleanup Procedures
-
Remove unused resources
- Orphaned PVCs
- Completed jobs
- Old backups
- Unused configs
-
Storage optimization
- Compress old logs
- Archive unused data
- Clean container cache
Monitoring and Alerts
Key Metrics
- Node health
- Pod status
- Resource usage
- Storage capacity
- Network performance
Alert Response
- Acknowledge alert
- Assess impact
- Investigate root cause
- Apply fix
- Document resolution
Security Maintenance
Regular Tasks
graph TD Audit[Security Audit] --> Review[Review Findings] Review --> Update[Update Policies] Update --> Test[Test Changes] Test --> Document[Document Changes]
Security Checklist
- Review network policies
- Check certificate expiration
- Audit access controls
- Review secret rotation
- Scan for vulnerabilities
Troubleshooting Guide
Common Issues
-
Node Problems
- Check node status
- Review system logs
- Verify resource usage
- Check connectivity
-
Storage Issues
- Verify mount points
- Check permissions
- Monitor capacity
- Review I/O performance
-
Network Problems
- Check DNS resolution
- Verify network policies
- Review ingress status
- Test connectivity
Recovery Procedures
- Node Recovery
# Check node status
kubectl get nodes
# Drain node for maintenance
kubectl drain node-name
# Perform maintenance
# ...
# Uncordon node
kubectl uncordon node-name
- Storage Recovery
# Check PV status
kubectl get pv
# Check PVC status
kubectl get pvc
# Verify storage class
kubectl get sc
Documentation
Maintenance Logs
- Keep detailed records
- Document changes
- Track issues
- Update procedures
Review Process
- Regular documentation review
- Update procedures
- Verify accuracy
- Add new sections
Best Practices
-
Change Management
- Use git workflow
- Test changes
- Document updates
- Monitor results
-
Resource Management
- Regular cleanup
- Optimize usage
- Monitor trends
- Plan capacity
-
Security
- Regular audits
- Update policies
- Monitor access
- Review logs
Troubleshooting Guide
Diagnostic Workflow
graph TD Issue[Issue Detected] --> Triage[Triage] Triage --> Diagnose[Diagnose] Diagnose --> Fix[Apply Fix] Fix --> Verify[Verify] Verify --> Document[Document]
Common Issues
1. Cluster Health Issues
Node Problems
graph TD Node[Node Issue] --> Check[Check Status] Check --> |Healthy| Resources[Resource Issue] Check --> |Unhealthy| System[System Issue] Resources --> Memory[Memory] Resources --> CPU[CPU] Resources --> Disk[Disk] System --> Logs[Check Logs] System --> Network[Network]
Diagnosis Steps:
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check system resources
kubectl top nodes
kubectl top pods --all-namespaces
# Check system logs
kubectl logs -n kube-system <pod-name>
2. Storage Issues
Volume Problems
graph LR PV[PV Issue] --> Status[Check Status] Status --> |Bound| Access[Access Issue] Status --> |Pending| Provision[Provisioning Issue] Status --> |Failed| Storage[Storage System]
Resolution Steps:
# Check PV/PVC status
kubectl get pv,pvc --all-namespaces
# Check storage class
kubectl get sc
# Check provisioner pods
kubectl get pods -n storage
3. Network Issues
Connectivity Problems
graph TD Net[Network Issue] --> DNS[DNS Check] Net --> Ingress[Ingress Check] Net --> Policy[Network Policy] DNS --> CoreDNS[CoreDNS Pods] Ingress --> Traefik[Traefik Logs] Policy --> Rules[Policy Rules]
Diagnostic Commands:
# Check DNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Check ingress
kubectl get ingress --all-namespaces
kubectl describe ingress <ingress-name> -n <namespace>
4. Application Issues
Pod Problems
graph TD Pod[Pod Issue] --> Status[Check Status] Status --> |Pending| Schedule[Scheduling] Status --> |CrashLoop| Crash[Container Crash] Status --> |Error| Logs[Check Logs]
Troubleshooting Steps:
# Check pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
# Check logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
Flux Issues
GitOps Troubleshooting
graph TD Flux[Flux Issue] --> Source[Source Controller] Flux --> Kust[Kustomize Controller] Flux --> Helm[Helm Controller] Source --> Git[Git Repository] Kust --> Sync[Sync Status] Helm --> Release[Release Status]
Resolution Steps:
# Check Flux components
flux check
# Check sources
flux get sources git
flux get sources helm
# Check reconciliation
flux get kustomizations
flux get helmreleases
Performance Issues
Resource Constraints
graph LR Perf[Performance] --> CPU[CPU Usage] Perf --> Memory[Memory Usage] Perf --> IO[I/O Usage] CPU --> Limit[Resource Limits] Memory --> Constraint[Memory Constraints] IO --> Bottleneck[I/O Bottleneck]
Analysis Commands:
# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
# Check resource quotas
kubectl get resourcequota -n <namespace>
Recovery Procedures
1. Node Recovery
- Drain node
- Perform maintenance
- Uncordon node
- Verify workloads
2. Storage Recovery
- Backup data
- Fix storage issues
- Restore data
- Verify access
3. Network Recovery
- Check connectivity
- Verify DNS
- Test ingress
- Update policies
Best Practices
1. Logging
- Maintain detailed logs
- Set appropriate retention
- Use structured logging
- Enable audit logging
2. Monitoring
- Set up alerts
- Monitor resources
- Track metrics
- Use dashboards
3. Documentation
- Document issues
- Record solutions
- Update procedures
- Share knowledge
Emergency Procedures
Critical Issues
- Assess impact
- Implement temporary fix
- Plan permanent solution
- Update documentation
Contact Information
- Maintain escalation paths
- Keep contact list updated
- Document response times
- Track incidents