Network Operations Runbook

Overview

This runbook provides step-by-step procedures for common network operations, troubleshooting, and emergency recovery scenarios for the Dapper Cluster network.

Quick Reference:

Network Topology Documentation
Brocade Core: 192.168.1.20
Arista Distribution: 192.168.1.21
Aruba Access: 192.168.1.26

Common Operations
Troubleshooting Procedures
Emergency Procedures
Switch Configuration
Performance Monitoring
Maintenance Windows

Common Operations

Accessing Network Equipment

SSH to Switches

Brocade ICX6610:

ssh admin@192.168.1.20
# [TODO: Document default credentials location]

Arista 7050:

ssh admin@192.168.1.21
# [TODO: Document default credentials location]

Aruba S2500-48p:

ssh admin@192.168.1.26
# [TODO: Document default credentials location]

Console Access

When SSH is unavailable:

# [TODO: Document console server or direct serial access]
# Brocade: Serial settings [TODO: baud rate, etc]
# Arista: Serial settings [TODO: baud rate, etc]

Checking Switch Health

Brocade ICX6610

# Basic health check
show version
show chassis
show cpu
show memory
show log tail 50

# Temperature and power
show inline power
show environment

# Check for errors
show logging | include error
show logging | include warn

Arista 7050

# Basic health check
show version
show environment all
show processes top

# Check for errors
show logging last 100
show logging | grep -i error

Verifying VLAN Configuration

Check VLAN Assignments

Brocade:

show vlan

# Check specific VLAN
show vlan 100
show vlan 150
show vlan 200

# Check which ports are in which VLANs
show vlan ethernet 1/1/1

Arista:

show vlan

# Check VLAN details
show vlan id 200

# Show interfaces by VLAN
show interfaces status

Verify Trunk Ports

Brocade:

# Show trunk configuration
show interface brief | include Trunk

# Show specific trunk
show interface ethernet 1/1/41
show interface ethernet 1/1/42

Arista:

# Show trunk ports
show interface trunk

# Show specific interface
show interface ethernet 49
show interface ethernet 50

Checking Link Aggregation (LAG) Status

Brocade LAG Status

# Show all LAG groups
show lag brief

# Show specific LAG details
show lag [lag-id]

# Show which ports are in LAG
show lag | include active

# Check individual LAG port status
show interface ethernet 1/1/41
show interface ethernet 1/1/42

Expected Output When Working:

LAG "brocade-to-arista" (lag-id [X]) has 2 active ports:
  ethernet 1/1/41 (40Gb) - Active
  ethernet 1/1/42 (40Gb) - Active

Arista Port-Channel Status

# Show port-channel summary
show port-channel summary

# Show specific port-channel
show interface port-channel 1

# Check member interfaces
show interface ethernet 49 port-channel
show interface ethernet 50 port-channel

Expected Output When Working:

Port-Channel1:
  Active Ports: 2
  Et49: Active
  Et50: Active
  Protocol: LACP

Monitoring Traffic and Bandwidth

Real-Time Interface Statistics

Brocade:

# Show interface rates
show interface ethernet 1/1/41 | include rate
show interface ethernet 1/1/42 | include rate

# Show all interface statistics
show interface ethernet 1/1/41

# Monitor in real-time (if supported)
monitor interface ethernet 1/1/41

Arista:

# Show interface counters
show interface ethernet 49 counters

# Show interface rates
show interface ethernet 49 | grep rate

# Real-time monitoring
watch 1 show interface ethernet 49 counters rate

Identify Top Talkers

Brocade:

# [TODO: Document method to identify top talkers]
# May require SNMP monitoring or sFlow

Arista:

# Check interface utilization
show interface counters utilization

# If sFlow configured:
# [TODO: Document sFlow commands]

Testing Connectivity

From Your Workstation

Test Management Plane:

# Ping all management interfaces
ping -c 4 192.168.1.20  # Brocade
ping -c 4 192.168.1.21  # Arista
ping -c 4 192.168.1.26  # Aruba
ping -c 4 192.168.1.7   # Mikrotik House
ping -c 4 192.168.1.8   # Mikrotik Shop

# Test wireless bridge latency
ping -c 100 192.168.1.8 | tail -3

Test Server Network (VLAN 100):

# Test Kubernetes nodes
ping -c 4 10.100.0.40   # K8s VIP
ping -c 4 10.100.0.50   # talos-control-1
ping -c 4 10.100.0.51   # talos-control-2
ping -c 4 10.100.0.52   # talos-control-3

Test from Kubernetes Nodes:

# SSH to a Talos node (if enabled) or use kubectl exec
kubectl exec -it -n default <pod-name> -- sh

# Test connectivity
ping 10.150.0.10   # Storage network
ping 10.100.0.1    # Gateway
ping 8.8.8.8       # Internet

MTU Testing (Jumbo Frames)

Test VLAN 150/200 MTU 9000:

# From a host on VLAN 150
ping -M do -s 8972 10.150.0.10

# -M do: Don't fragment
# -s 8972: 8972 + 28 (IP+ICMP headers) = 9000

# If this fails but smaller packets work, MTU is misconfigured

Path Testing

Trace route across networks:

# From your workstation
traceroute 10.100.0.50

# Expected path (if everything is working):
# 1. Local gateway
# 2. Wireless bridge
# 3. Brocade/OPNsense
# 4. Destination

Troubleshooting Procedures

Issue: No Connectivity to Garage Switches

Symptoms:

Cannot ping/SSH to Brocade (192.168.1.20) or Arista (192.168.1.21)
Can ping Aruba switch (192.168.1.26)

Diagnosis:

Test wireless bridge:
```
ping 192.168.1.7   # Mikrotik House
ping 192.168.1.8   # Mikrotik Shop
```
- If 192.168.1.7 responds but 192.168.1.8 doesn't: Wireless link down
- If neither respond: Mikrotik issue or config problem

Check Aruba-to-Mikrotik connection:

# SSH to Aruba
ssh admin@192.168.1.26

# Check port status for Mikrotik connection
show interface [TODO: port ID]

Resolution:

If wireless bridge is down:

Check Mikrotik radios web interface (192.168.1.7, 192.168.1.8)
Check alignment and signal strength
Verify power to both radios
Check for interference (weather, obstacles)
Emergency: Use physical console access to switches in garage

If Mikrotik is up but switches unreachable:

Check VLAN 1 configuration on trunk ports
Verify Mikrotik is not blocking traffic
Check Brocade port connected to Mikrotik is up

Issue: Kubernetes Pods Can't Access Storage

Symptoms:

Pods stuck in ContainerCreating
PVC stuck in Pending
Errors about unable to mount CephFS

Diagnosis:

Check Rook/Ceph health:

kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Check network connectivity from Kubernetes nodes to Ceph monitors:

# From a Talos node or debug pod
ping 10.150.0.10   # Test VLAN 150 connectivity

# Test Ceph monitor port
nc -zv <monitor-ip> 6789

Verify VLAN 150 MTU:

# Test jumbo frames
ping -M do -s 8972 10.150.0.10

Check CSI driver logs:

kubectl -n rook-ceph logs -l app=csi-cephfsplugin --tail=100

Resolution:

If MTU mismatch:

Verify MTU 9000 on all VLAN 150 interfaces
Check Proxmox bridge MTU settings
Check switch port MTU configuration

If connectivity issue:

Check VLAN 150 is properly tagged on trunk ports
Verify Proxmox host network configuration
Check Brocade routing for VLAN 150

Issue: Slow Ceph Performance

Symptoms:

Slow pod startup times
High I/O latency in applications
Ceph health warnings about slow ops

Diagnosis:

Check Ceph cluster health:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf

Check network bandwidth utilization:

On Brocade (VLAN 150 - Ceph Public):

# Check 10Gb bonds to Proxmox hosts
show interface ethernet 1/1/[TODO: ports] | include rate

On Arista (VLAN 200 - Ceph Cluster):

# Check 40Gb links to Proxmox hosts
show interface ethernet [TODO: ports] counters rate

Identify bottlenecks:
- Are 10Gb links saturated? (VLAN 150)
- Are 40Gb links saturated? (VLAN 200)
- Is the Brocade-Arista link saturated?

Resolution:

If Brocade-Arista link is bottleneck:

Primary Issue: Only one 40Gb link active (see below to enable second link)
Enabling second 40Gb link will double bandwidth to 80Gbps

If MTU not configured:

Verify MTU 9000 on VLAN 150 and 200
Check each hop in the path

If switch CPU is high:

Check for broadcast storms
Verify STP is working correctly
Look for loops in topology

Issue: Network Loop / Broadcast Storm

Symptoms:

Network performance severely degraded
High CPU usage on switches
Connectivity flapping
Massive packet rates on interfaces

Diagnosis:

Check for duplicate MAC addresses:

# Brocade
show mac-address

# Look for same MAC on multiple ports

Check STP status:

# Brocade
show spanning-tree

# Arista
show spanning-tree

Look for physical loops:
- Review physical topology diagram
- Check for accidental double connections
- Known issue: Brocade-Arista 2x 40Gb links not in LAG

Resolution:

Immediate (Emergency):

Disable one link causing loop:

# On Arista (already done in current config)
configure
interface ethernet 50
shutdown

Verify spanning-tree is enabled:

# Brocade
show spanning-tree

# If not enabled:
configure terminal
spanning-tree

Permanent Fix:

Configure proper LAG/port-channel (see section below)

Issue: Proxmox Host Loses Network Connectivity

Symptoms:

Cannot ping Proxmox host management IP
VMs on host also offline
IPMI still accessible

Diagnosis:

Access via IPMI console:
```
# [TODO: Document IPMI access method]
```

Check bond status on Proxmox:

# From Proxmox console
ip link show

# Check bond interfaces
cat /proc/net/bonding/bond0
cat /proc/net/bonding/bond1

Check switch ports:

# On Brocade
show interface ethernet 1/1/[TODO: ports for this host]
show lag [TODO: lag-id for this host]

Resolution:

If bond is down on Proxmox:

Check physical cables
Restart networking on Proxmox (WARNING: will disrupt VMs)
Check switch port status

If ports down on switch:

Check for error counters
Re-enable port if administratively down
Check for physical issues (SFP, cable)

Issue: High Latency Across Wireless Bridge

Symptoms:

Ping times to garage > 10ms (normally 1-2ms)
Slow access to services in garage
Packet loss

Diagnosis:

Test latency:

ping -c 100 192.168.1.8

# Look at:
# - Average latency
# - Packet loss %
# - Jitter (variation)

Check Mikrotik radio status:
- Access web interface: 192.168.1.7 and 192.168.1.8
- Check signal strength
- Check throughput/bandwidth utilization
- Look for interference

Test with iperf:

# On server side (garage)
iperf3 -s

# On client side (house)
iperf3 -c 192.168.1.8 -t 30

# Should see ~1 Gbps

Resolution:

If signal degraded:

Check for obstructions (trees, weather)
Check alignment
Check for interference sources
Consider backup link or failover

If bandwidth saturated:

Identify high-bandwidth users/applications
Implement QoS if available
Consider upgrade to higher bandwidth link

Emergency Procedures

Complete Network Outage (Wireless Bridge Down)

Impact:

No remote access to garage infrastructure
Kubernetes cluster still functions internally
No internet access from garage
Management access requires physical presence

Emergency Access Methods:

Physical console access:

# [TODO: Document where console cables are stored]
# Connect laptop directly to switch console port

IPMI access (if VPN or alternative route exists):
```
# [TODO: Document IPMI network topology]
```

Restoration Steps:

Check Mikrotik radios:
- Physical inspection of both radios
- Power cycle if needed
- Check alignment
Temporary workaround:
- [TODO: Document backup connectivity method]
- VPN tunnel over alternative route?
- Temporary cable run?

Verify restoration:

ping 192.168.1.8
ping 192.168.1.20
ssh admin@192.168.1.20

Core Switch (Brocade) Failure

Impact:

Loss of VLAN 150/200 routing
Kubernetes cluster degraded (storage issues)
Loss of 10Gb connectivity to Proxmox hosts

Emergency Actions:

Do NOT reboot all Proxmox hosts simultaneously
- Cluster may be operational on running workloads
- Storage connections via VLAN 200 through Arista may still work
Check Brocade status:
- Physical inspection (power, fans, LEDs)
- Console access
- Review logs
If Brocade must be replaced:
- [TODO: Document backup configuration location]
- [TODO: Document restoration procedure]
- [TODO: Document spare hardware location]

Spanning Tree Failure / Network Loop

Impact:

Network completely unusable
High CPU on all switches
Broadcast storm

Emergency Actions:

Disconnect Brocade-Arista links:

# On Arista (fastest access if SSH still works)
configure
interface ethernet 49
shutdown
interface ethernet 50
shutdown

Or physically disconnect:
- Unplug both 40Gb QSFP+ cables between Brocade and Arista
Wait for network to stabilize (30-60 seconds)

Reconnect ONE link only:

# On Arista
configure
interface ethernet 49
no shutdown

Verify stability before enabling second link

Accidental Configuration Change

Symptoms:

Network suddenly degraded after change
New errors appearing
Connectivity loss

Emergency Actions:

Rollback configuration:

Brocade:

# Show configuration history
show configuration

# Revert to previous config
# [TODO: Document Brocade config rollback method]

Arista:

# Show rollback options
show configuration sessions

# Rollback to previous
configure session rollback <session-name>

If rollback not available:
- Reboot switch (loads startup-config)
- WARNING: Brief outage during reboot

Switch Configuration

Configure Brocade-Arista LAG (Fix Loop Issue)

Prerequisites:

Maintenance window scheduled
Both 40Gb QSFP+ cables connected and working
Console access to both switches available
Configuration backed up

Step 1: Pre-Change Verification

# Verify current state
# On Brocade:
show interface ethernet 1/1/41
show interface ethernet 1/1/42

# On Arista:
show interface ethernet 49
show interface ethernet 50  # Currently disabled

# Document current traffic levels
show interface ethernet 1/1/41 | include rate

Step 2: Configure Brocade LAG

# SSH to Brocade
ssh admin@192.168.1.20

# Enter configuration mode
enable
configure terminal

# Create LAG
lag brocade-to-arista dynamic id [TODO: Choose available LAG ID, e.g., 10]
  ports ethernet 1/1/41 to 1/1/42
  primary-port 1/1/41
  lacp-timeout short
  deploy
exit

# Configure VLAN on LAG
vlan 1
  tagged lag [LAG-ID]
exit

vlan 100
  tagged lag [LAG-ID]
exit

vlan 150
  tagged lag [LAG-ID]
exit

vlan 200
  tagged lag [LAG-ID]
exit

# Apply to interfaces
interface ethernet 1/1/41
  link-aggregate active
exit

interface ethernet 1/1/42
  link-aggregate active
exit

# Save configuration
write memory

# Verify
show lag brief
show lag [LAG-ID]

Step 3: Configure Arista Port-Channel

# SSH to Arista
ssh admin@192.168.1.21

# Enter configuration mode
enable
configure

# Create port-channel
interface Port-Channel1
  description Link to Brocade ICX6610
  switchport mode trunk
  switchport trunk allowed vlan 1,100,150,200
exit

# Add member interfaces
interface Ethernet49
  description Brocade 40G Link 1
  channel-group 1 mode active
  lacp rate fast
exit

interface Ethernet50
  description Brocade 40G Link 2
  channel-group 1 mode active
  lacp rate fast
exit

# Save configuration
write memory

# Verify
show port-channel summary
show interface Port-Channel1
show lacp neighbor

Step 4: Verify Configuration

# On Brocade:
show lag [LAG-ID]
# Should show: 2 ports active

show lacp
# Should show: Negotiated with neighbor

# On Arista:
show port-channel summary
# Should show: Po1(U) with Et49(P), Et50(P)

show lacp neighbor
# Should show: Brocade as partner

# Test traffic balancing
show interface Port-Channel1 counters
show interface ethernet 49 counters
show interface ethernet 50 counters
# Both Et49 and Et50 should show traffic

Step 5: Monitor for Issues

# Watch for 15 minutes
# On Arista:
watch 10 show port-channel summary

# Check for errors
show logging | grep -i Port-Channel1

# Monitor CPU (should be normal)
show processes top

Rollback Plan (if issues occur):

# On Arista (fastest to disable)
configure
interface ethernet 50
shutdown

# On Brocade (if needed)
configure terminal
no lag brocade-to-arista
interface ethernet 1/1/41
  no link-aggregate
interface ethernet 1/1/42
  no link-aggregate

Adding a New VLAN

Example: Adding VLAN 300 for IoT devices

Step 1: Plan VLAN

VLAN ID: 300
Network: [TODO: e.g., 10.30.0.0/24]
Gateway: [TODO: Which device?]
Required on: [TODO: Which switches/trunks?]

Step 2: Create VLAN on Brocade

ssh admin@192.168.1.20
enable
configure terminal

# Create VLAN
vlan 300
  name IoT-Network
  tagged ethernet 1/1/41 to 1/1/42  # Trunk to Arista
  tagged ethernet 1/1/[TODO]         # Trunk to Mikrotik
  untagged ethernet 1/1/[TODO]       # Access ports if needed
exit

# If Brocade is gateway:
interface ve 300
  ip address [TODO: IP]/24
exit

# Save
write memory

Step 3: Add to other switches as needed

# On Arista:
configure
vlan 300
  name IoT-Network
exit

interface Port-Channel1
  switchport trunk allowed vlan add 300
exit

write memory

Configuring Jumbo Frames (MTU 9000)

For VLAN 150 and 200 (Ceph networks)

On Brocade:

# Configure MTU on VLAN interfaces
interface ve 150
  mtu 9000
exit

interface ve 200
  mtu 9000
exit

# Configure MTU on physical/LAG interfaces
interface ethernet 1/1/[TODO: storage network ports]
  mtu 9000
exit

write memory

On Arista:

# Configure MTU on interfaces carrying VLAN 150/200
interface ethernet [TODO: ports]
  mtu 9216  # 9000 + overhead
exit

write memory

Verify MTU:

# From Talos node
ping -M do -s 8972 10.150.0.10
# Should succeed without fragmentation

Performance Monitoring

Key Metrics to Monitor

Switch Health:

CPU utilization (should be <30% normally)
Memory utilization (should be <70%)
Temperature (within operating range)
Power supply status

Interface Health:

Error counters (input/output errors)
CRC errors
Interface resets
Utilization percentage

Traffic Patterns:

Bandwidth utilization per interface
Top talkers per VLAN
Broadcast/multicast rates

Setting Up Monitoring

[TODO: Document monitoring setup]

Options:

SNMP monitoring to Prometheus
sFlow for traffic analysis
Switch logging to Loki
Grafana dashboards

Example Prometheus Targets:

# [TODO: Example prometheus config for SNMP exporter]

Baseline Performance Metrics

Normal Operating Conditions:

Metric	Expected Value	Alert Threshold
Wireless Bridge Latency	1-2ms	> 5ms
Wireless Bridge Loss	0%	> 1%
Brocade CPU	< 20%	> 60%
Arista CPU	< 15%	> 50%
40Gb Link Utilization	< 50%	> 80%
10Gb Link Utilization	< 60%	> 85%

[TODO: Add baseline measurements]

Maintenance Windows

Pre-Maintenance Checklist

Before any network maintenance:

Schedule maintenance window
Notify all users

Back up switch configurations

# Brocade
show running-config > backup-$(date +%Y%m%d).cfg

# Arista
show running-config > backup-$(date +%Y%m%d).cfg

Document current state
Have rollback plan ready
Ensure console access available
Test backup connectivity method

Post-Maintenance Checklist

After any network maintenance:

Verify all links are up

show interface brief
show lag brief  # Brocade
show port-channel summary  # Arista

Check for errors
```
show logging | include error
```
Test connectivity to all VLANs
Monitor for 30 minutes for issues
Update documentation with any changes
Save configurations
```
write memory
```

Regular Maintenance Tasks

Weekly:

Review switch logs for errors/warnings
Check interface error counters
Verify wireless bridge performance

Monthly:

Review bandwidth utilization trends
Check for firmware updates
Verify backup configurations are current

Quarterly:

Review and update network documentation
Test emergency procedures
Review and optimize switch configurations

Configuration Backup

Backing Up Switch Configurations

Brocade ICX6610:

# Method 1: Copy to TFTP server
copy running-config tftp [TODO: TFTP server IP] brocade-backup-$(date +%Y%m%d).cfg

# Method 2: Display and save manually
show running-config > /tmp/brocade-config.txt

# [TODO: Document automated backup method]

Arista 7050:

# Show running config
show running-config

# Copy to USB (if available)
copy running-config usb:/brocade-backup-$(date +%Y%m%d).cfg

# [TODO: Document automated backup method]

Storage Location:

[TODO: Document where configurations are backed up]
Consider: Git repository for version control
Consider: Automated daily backups via Ansible

Restoring Configurations

Brocade:

# Load config from file
copy tftp running-config [TFTP-IP] [filename]

# Or manually paste config
configure terminal
# Paste configuration

Arista:

# Copy config from file
copy usb:/backup.cfg running-config

# Or configure manually
configure
# Paste configuration

Security Considerations

Access Control

[TODO: Document security policies]

Who has access to switch management?
How are credentials managed?
Is 2FA available/configured?
Are management VLANs isolated?

Security Best Practices

Change default passwords
Disable unused ports
Enable port security where appropriate
Configure DHCP snooping
Enable storm control
Regular firmware updates
Monitor for unauthorized devices

Useful Commands Reference

Brocade ICX6610 Quick Reference

# Basic show commands
show version
show running-config
show interface brief
show vlan
show lag
show mac-address
show spanning-tree
show log

# Interface management
interface ethernet 1/1/1
  enable
  disable
  description [text]

# Save configuration
write memory

Arista 7050 Quick Reference

# Basic show commands
show version
show running-config
show interfaces status
show vlan
show port-channel summary
show mac address-table
show spanning-tree
show logging

# Interface management
configure
interface ethernet 1
  shutdown
  no shutdown
  description [text]

# Save configuration
write memory

Contacts and Escalation

[TODO: Fill in contact information]

Role	Name	Contact	Escalation Level
Primary Network Admin	[TODO]	[TODO]	1
Secondary Contact	[TODO]	[TODO]	2
Vendor Support - Brocade	[TODO]	[TODO]	3
Vendor Support - Arista	[TODO]	[TODO]	3

Change Log

Date	Change	Person	Impact	Notes
2025-10-14	Initial runbook created	Claude	None	Baseline documentation
[TODO]	[TODO]	[TODO]	[TODO]	[TODO]

References

Network Topology Documentation
Storage Architecture - For Ceph network details
Brocade ICX6610 Documentation: [TODO: Link]
Arista 7050 Documentation: [TODO: Link]
[TODO: Add other relevant documentation links]