Fixing Dell PowerEdge R620 Freezes: PSU Hot Spare, Power Cap Policy, and Power Budget Analysis

September 21, 2025

Legacy servers still serve well for lab clusters, but mixing older hardware with modern workloads often surfaces subtle, hardware–software interaction issues. On a Dell PowerEdge R620 running Ubuntu 20.04, I observed a repeatable NIC failure scenario: every ~20 minutes, the primary interface dropped connectivity, then came back after a forced event (such as pressing the power button).

This instability appeared after I temporarily installed an NVIDIA Tesla T4 GPU for ML workloads and later removed it.


Environment Context

  • Server: Dell PowerEdge R620

  • OS: Ubuntu 20.04 LTS (kernel 5.4, later tested with 5.15 HWE)

  • NICs: Quad-port Intel 1GbE adapters, driven by igb driver 5.6.0-k

  • Use case: Serving internet-facing APIs and running databases

  • Event timeline:

    1. Server stable before GPU installation.
    2. Installed NVIDIA T4 (PCIe Gen3 GPU).
    3. Removed T4 after testing.
    4. NICs began link flapping every ~20 minutes.

Observed Behavior

  • Network outage for a few seconds every ~20 minutes.

  • Logs showed repeated “link up” renegotiations:

    igb 0000:01:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
  • Pressing the chassis power button immediately restored connectivity (ACPI interrupt).

  • Kernel logs contained UFW multicast blocks (unrelated noise), and SATA “link down” events for empty ports (benign).


Step 1: Understanding the Server Power Budget

Every server component consumes power differently:

ComponentTypical Draw (R620 class)Notes
CPUs (2x Xeon E5-26xx v2)95–130W eachPeak 260W combined
Memory (24x DDR3 ECC DIMMs)3–5W idle, 8–12W loaded → up to 200WFully loaded chassis burns a lot of watts in RAM
Fans (6 high-speed blowers)10–20W each → 60–120WDraw increases if GPU or ambient temp is high
NICs (onboard Intel 1GbE)5–10W totalSmall average load, but burst-sensitive
RAID/HBA (PERC H710)12–15WConstant baseline
GPU (Tesla T4)~70W steady, higher burstsBriefly installed, altered PCIe topology
Misc (drives, backplane)20–40WAdds to baseline

Who dominates?

  • Steady state: CPUs and memory are the biggest consumers.
  • Burst load: PCIe devices (NICs, GPUs, HBAs) can cause sudden spikes, even if their average power is low.

Step 2: PSU Hot Spare Behavior

Dell servers with dual PSUs allow two policies:

  • Hot Spare = Enabled

    • One PSU is active, the other idle.
    • Server runs at half of total PSU capacity (e.g., 750W of 1500W).
    • Any transient spike (NIC burst, PCIe lane wakeup, GPU ramp) must be absorbed by that single PSU.
  • Hot Spare = Disabled (Redundant Mode)

    • Both PSUs are active and load-share.
    • Still redundant: if one fails, the other instantly ramps up.
    • System has full headroom for bursts, reducing risk of power droop.

Step 3: Power Cap Policy Interaction

Dell iDRAC/BIOS allows enforcing a system-wide power cap (e.g., 450W).

  • With Hot Spare enabled, that cap applies against a single PSU.
  • This artificially limits headroom even further.
  • Result: burst demand from PCIe or NICs → cap enforced → droop → system freeze.

Step 4: Why NICs Showed the Problem First

Even though CPUs and RAM consume more overall, NICs and PCIe cards are burst-sensitive:

  • When a NIC flushes TX/RX queues under high traffic, it can spike power in milliseconds.
  • With only one PSU active (Hot Spare), those bursts starved PCIe lanes.
  • The NIC link reset (NIC Link is Up) was the symptom of transient droop.
  • As bursts increased, the entire PCIe bus destabilized → OS freeze.

Root Cause Analysis

  1. igb Driver Age

    • Ubuntu 20.04 ships with igb 5.6.0-k, dating back to ~2014.
    • Known issue: the driver mishandles low-power states, especially with ASPM and EEE enabled.
  2. Impact of GPU Install/Remove

    • Installing/removing a PCIe Gen3 GPU (NVIDIA T4) modified PCIe lane routing and ASPM configuration.
    • After removal, residual ASPM/ACPI state pushed the NICs into unstable power modes.
    • Legacy R620 firmware (2012–2013) was not designed with these devices in mind.
  3. Power Management Features

    • Energy Efficient Ethernet (EEE): enabled by default, frequently causes renegotiation on Intel igb NICs.
    • PCIe ASPM: interacts poorly with igb after PCIe topology changes, causing link dropouts.
  4. Firmware/BIOS Age

    • Without Dell BIOS + NIC firmware updates, igb stability issues are amplified.

Monitoring and Verification

To confirm the failure mode, I ran continuous checks:

# Log monitoring sudo dmesg --follow | grep eno1 # Link state monitoring watch -n 5 cat /sys/class/net/eno1/carrier

During drops, carrier flipped 1 → 0 → 1.

Driver information:

ethtool -i eno1

confirmed igb 5.6.0-k.


Resolution Steps

1. PSU Settings (BIOS/iDRAC)

  • Disable Hot Spare.
  • Keep Redundancy = Redundant.
  • Disable System Power Cap Policy.
  • Set Performance Mode for PCIe power.

2. Disable Energy Efficient Ethernet (EEE)

sudo ethtool --set-eee eno1 eee off sudo ethtool --set-eee eno2 eee off sudo ethtool --set-eee eno3 eee off sudo ethtool --set-eee eno4 eee off

Persistent configuration via /etc/networkd-dispatcher/routable.d/disable-eee.sh:

#!/bin/bash ethtool --set-eee eno1 eee off ethtool --set-eee eno2 eee off ethtool --set-eee eno3 eee off ethtool --set-eee eno4 eee off

3. Disable PCIe ASPM

In /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

Then:

sudo update-grub && sudo reboot

4. Update Firmware and BIOS

Using Dell System Update (dsu):

wget -q -O - https://linux.dell.com/repo/hardware/dsu/bootstrap.cgi | sudo bash sudo apt install dell-system-update sudo dsu

5 Update Kernel (Newer igb Driver)

sudo apt install linux-generic-hwe-20.04

Brings kernel 5.15 with igb driver fixes.

6. Optional Watchdog Script

To auto-recover if the link still drops:

while true; do if ! ping -c1 -W1 8.8.8.8 >/dev/null; then logger "NIC down, resetting eno1..." ip link set eno1 down sleep 2 ip link set eno1 up fi sleep 30 done

Takeaways

  • Power policies can masquerade as hardware or OS bugs.
  • Hot Spare mode halves available PSU capacity — stable for office loads, risky for bursty lab workloads.
  • PCIe topology changes (like installing/removing GPUs) can destabilize unrelated devices on legacy servers.
  • igb driver (5.6.0-k) has long-standing instability with EEE and ASPM. Newer kernels significantly improve this.
  • Dell firmware updates are critical — old NIC firmware doesn’t handle modern Linux kernels well.
  • Power management “features” (EEE, ASPM) should be disabled in servers expected to provide reliable networking.

Join the Discussion

Share your thoughts and insights about this system.