Fixing Dell PowerEdge R620 Freezes: PSU Hot Spare, Power Cap Policy, and Power Budget Analysis

Legacy servers still serve well for lab clusters, but mixing older hardware with modern workloads often surfaces subtle, hardware–software interaction issues. On a Dell PowerEdge R620 running Ubuntu 20.04, I observed a repeatable NIC failure scenario: every ~20 minutes, the primary interface dropped connectivity, then came back after a forced event (such as pressing the power button).

This instability appeared after I temporarily installed an NVIDIA Tesla T4 GPU for ML workloads and later removed it.

Environment Context

Server: Dell PowerEdge R620
OS: Ubuntu 20.04 LTS (kernel 5.4, later tested with 5.15 HWE)
NICs: Quad-port Intel 1GbE adapters, driven by igb driver 5.6.0-k
Use case: Serving internet-facing APIs and running databases
Event timeline:
1. Server stable before GPU installation.
2. Installed NVIDIA T4 (PCIe Gen3 GPU).
3. Removed T4 after testing.
4. NICs began link flapping every ~20 minutes.

Observed Behavior

Network outage for a few seconds every ~20 minutes.

Logs showed repeated “link up” renegotiations:

igb 0000:01:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

Pressing the chassis power button immediately restored connectivity (ACPI interrupt).
Kernel logs contained UFW multicast blocks (unrelated noise), and SATA “link down” events for empty ports (benign).

Step 1: Understanding the Server Power Budget

Every server component consumes power differently:

Component	Typical Draw (R620 class)	Notes
CPUs (2x Xeon E5-26xx v2)	95–130W each	Peak 260W combined
Memory (24x DDR3 ECC DIMMs)	3–5W idle, 8–12W loaded → up to 200W	Fully loaded chassis burns a lot of watts in RAM
Fans (6 high-speed blowers)	10–20W each → 60–120W	Draw increases if GPU or ambient temp is high
NICs (onboard Intel 1GbE)	5–10W total	Small average load, but burst-sensitive
RAID/HBA (PERC H710)	12–15W	Constant baseline
GPU (Tesla T4)	~70W steady, higher bursts	Briefly installed, altered PCIe topology
Misc (drives, backplane)	20–40W	Adds to baseline

Who dominates?

Steady state: CPUs and memory are the biggest consumers.
Burst load: PCIe devices (NICs, GPUs, HBAs) can cause sudden spikes, even if their average power is low.

Step 2: PSU Hot Spare Behavior

Dell servers with dual PSUs allow two policies:

Hot Spare = Enabled
- One PSU is active, the other idle.
- Server runs at half of total PSU capacity (e.g., 750W of 1500W).
- Any transient spike (NIC burst, PCIe lane wakeup, GPU ramp) must be absorbed by that single PSU.
Hot Spare = Disabled (Redundant Mode)
- Both PSUs are active and load-share.
- Still redundant: if one fails, the other instantly ramps up.
- System has full headroom for bursts, reducing risk of power droop.

Step 3: Power Cap Policy Interaction

Dell iDRAC/BIOS allows enforcing a system-wide power cap (e.g., 450W).

With Hot Spare enabled, that cap applies against a single PSU.
This artificially limits headroom even further.
Result: burst demand from PCIe or NICs → cap enforced → droop → system freeze.

Step 4: Why NICs Showed the Problem First

Even though CPUs and RAM consume more overall, NICs and PCIe cards are burst-sensitive:

When a NIC flushes TX/RX queues under high traffic, it can spike power in milliseconds.
With only one PSU active (Hot Spare), those bursts starved PCIe lanes.
The NIC link reset (NIC Link is Up) was the symptom of transient droop.
As bursts increased, the entire PCIe bus destabilized → OS freeze.

Root Cause Analysis

igb Driver Age
- Ubuntu 20.04 ships with igb 5.6.0-k, dating back to ~2014.
- Known issue: the driver mishandles low-power states, especially with ASPM and EEE enabled.
Impact of GPU Install/Remove
- Installing/removing a PCIe Gen3 GPU (NVIDIA T4) modified PCIe lane routing and ASPM configuration.
- After removal, residual ASPM/ACPI state pushed the NICs into unstable power modes.
- Legacy R620 firmware (2012–2013) was not designed with these devices in mind.
Power Management Features
- Energy Efficient Ethernet (EEE): enabled by default, frequently causes renegotiation on Intel igb NICs.
- PCIe ASPM: interacts poorly with igb after PCIe topology changes, causing link dropouts.
Firmware/BIOS Age
- Without Dell BIOS + NIC firmware updates, igb stability issues are amplified.

Monitoring and Verification

To confirm the failure mode, I ran continuous checks:

# Log monitoring
sudo dmesg --follow | grep eno1

# Link state monitoring
watch -n 5 cat /sys/class/net/eno1/carrier

During drops, carrier flipped 1 → 0 → 1.

Driver information:

ethtool -i eno1

confirmed igb 5.6.0-k.

Resolution Steps

1. PSU Settings (BIOS/iDRAC)

Disable Hot Spare.
Keep Redundancy = Redundant.
Disable System Power Cap Policy.
Set Performance Mode for PCIe power.

2. Disable Energy Efficient Ethernet (EEE)

sudo ethtool --set-eee eno1 eee off
sudo ethtool --set-eee eno2 eee off
sudo ethtool --set-eee eno3 eee off
sudo ethtool --set-eee eno4 eee off

Persistent configuration via /etc/networkd-dispatcher/routable.d/disable-eee.sh:

#!/bin/bash
ethtool --set-eee eno1 eee off
ethtool --set-eee eno2 eee off
ethtool --set-eee eno3 eee off
ethtool --set-eee eno4 eee off

3. Disable PCIe ASPM

In /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

Then:

sudo update-grub && sudo reboot

4. Update Firmware and BIOS

Using Dell System Update (dsu):

wget -q -O - https://linux.dell.com/repo/hardware/dsu/bootstrap.cgi | sudo bash
sudo apt install dell-system-update
sudo dsu

5 Update Kernel (Newer igb Driver)

sudo apt install linux-generic-hwe-20.04

Brings kernel 5.15 with igb driver fixes.

6. Optional Watchdog Script

To auto-recover if the link still drops:

while true; do
  if ! ping -c1 -W1 8.8.8.8 >/dev/null; then
    logger "NIC down, resetting eno1..."
    ip link set eno1 down
    sleep 2
    ip link set eno1 up
  fi
  sleep 30
done

Takeaways

Power policies can masquerade as hardware or OS bugs.
Hot Spare mode halves available PSU capacity — stable for office loads, risky for bursty lab workloads.
PCIe topology changes (like installing/removing GPUs) can destabilize unrelated devices on legacy servers.
igb driver (5.6.0-k) has long-standing instability with EEE and ASPM. Newer kernels significantly improve this.
Dell firmware updates are critical — old NIC firmware doesn’t handle modern Linux kernels well.
Power management “features” (EEE, ASPM) should be disabled in servers expected to provide reliable networking.