Dell PowerEdge R620 NIC Link Flapping After GPU Removal

September 21, 2025

Legacy servers still serve well for lab clusters, but mixing older hardware with modern workloads often surfaces subtle, hardware–software interaction issues. On a Dell PowerEdge R620 running Ubuntu 20.04, I observed a repeatable NIC failure scenario: every ~20 minutes, the primary interface dropped connectivity, then came back after a forced event (such as pressing the power button).

This instability appeared after I temporarily installed an NVIDIA Tesla T4 GPU for ML workloads and later removed it.


Environment Context

  • Server: Dell PowerEdge R620

  • OS: Ubuntu 20.04 LTS (kernel 5.4, later tested with 5.15 HWE)

  • NICs: Quad-port Intel 1GbE adapters, driven by igb driver 5.6.0-k

  • Use case: Serving internet-facing APIs and running databases

  • Event timeline:

    1. Server stable before GPU installation.
    2. Installed NVIDIA T4 (PCIe Gen3 GPU).
    3. Removed T4 after testing.
    4. NICs began link flapping every ~20 minutes.

Observed Behavior

  • Network outage for a few seconds every ~20 minutes.

  • Logs showed repeated “link up” renegotiations:

    igb 0000:01:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
    
  • Pressing the chassis power button immediately restored connectivity (ACPI interrupt).

  • Kernel logs contained UFW multicast blocks (unrelated noise), and SATA “link down” events for empty ports (benign).


Root Cause Analysis

  1. igb Driver Age

    • Ubuntu 20.04 ships with igb 5.6.0-k, dating back to ~2014.
    • Known issue: the driver mishandles low-power states, especially with ASPM and EEE enabled.
  2. Impact of GPU Install/Remove

    • Installing/removing a PCIe Gen3 GPU (NVIDIA T4) modified PCIe lane routing and ASPM configuration.
    • After removal, residual ASPM/ACPI state pushed the NICs into unstable power modes.
    • Legacy R620 firmware (2012–2013) was not designed with these devices in mind.
  3. Power Management Features

    • Energy Efficient Ethernet (EEE): enabled by default, frequently causes renegotiation on Intel igb NICs.
    • PCIe ASPM: interacts poorly with igb after PCIe topology changes, causing link dropouts.
  4. Firmware/BIOS Age

    • Without Dell BIOS + NIC firmware updates, igb stability issues are amplified.

Monitoring and Verification

To confirm the failure mode, I ran continuous checks:

# Log monitoring sudo dmesg --follow | grep eno1 # Link state monitoring watch -n 5 cat /sys/class/net/eno1/carrier

During drops, carrier flipped 1 → 0 → 1.

Driver information:

ethtool -i eno1

confirmed igb 5.6.0-k.


Resolution Steps

1. Disable Energy Efficient Ethernet (EEE)

sudo ethtool --set-eee eno1 eee off sudo ethtool --set-eee eno2 eee off sudo ethtool --set-eee eno3 eee off sudo ethtool --set-eee eno4 eee off

Persistent configuration via /etc/networkd-dispatcher/routable.d/disable-eee.sh:

#!/bin/bash ethtool --set-eee eno1 eee off ethtool --set-eee eno2 eee off ethtool --set-eee eno3 eee off ethtool --set-eee eno4 eee off

2. Disable PCIe ASPM

In /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

Then:

sudo update-grub && sudo reboot

3. Update Firmware and BIOS

Using Dell System Update (dsu):

wget -q -O - https://linux.dell.com/repo/hardware/dsu/bootstrap.cgi | sudo bash sudo apt install dell-system-update sudo dsu

4. Update Kernel (Newer igb Driver)

sudo apt install linux-generic-hwe-20.04

Brings kernel 5.15 with igb driver fixes.

5. Optional Watchdog Script

To auto-recover if the link still drops:

while true; do if ! ping -c1 -W1 8.8.8.8 >/dev/null; then logger "NIC down, resetting eno1..." ip link set eno1 down sleep 2 ip link set eno1 up fi sleep 30 done

Technical Takeaways

  • PCIe topology changes (like installing/removing GPUs) can destabilize unrelated devices on legacy servers.
  • igb driver (5.6.0-k) has long-standing instability with EEE and ASPM. Newer kernels significantly improve this.
  • Dell firmware updates are critical — old NIC firmware doesn’t handle modern Linux kernels well.
  • Power management “features” (EEE, ASPM) should be disabled in servers expected to provide reliable networking.

Join the Discussion

Share your thoughts and insights about this system.