Server Failure Survival Guide: How I Recovered Data and Rebuilt My Dell PowerEdge the Right Way

November 2, 2025
Server Failure Survival Guide: How I Recovered Data and Rebuilt My Dell PowerEdge the Right Way

There’s a certain kind of silence you never want to hear from your server — that sudden pause in the fans, the blinking LEDs going dark, and the sinking realization that something you took for granted just died.

This is the story of how my Dell PowerEdge R620 (and later, its successor R640) pushed me into one of the most humbling engineering experiences I’ve ever had: 🔥 a motherboard failure, 💥 a RAID-5 data recovery mission, and 🧩 a crash course in system resilience.

If you’re an engineer, a tinkerer, or anyone who builds their own servers, I hope this serves as both a warning and a manual.


Part 1: When Curiosity Met Power — Installing the Tesla T4

It started with excitement. I had just gotten my hands on an NVIDIA Tesla T4 GPU, a low-profile, power-efficient accelerator that’s perfect for inference workloads, LLM testing, and real-time AI applications.

The Dell PowerEdge R620 was sitting on my desk — a veteran 1U server with twin Intel Xeon processors and 495 W power supplies. “It should be fine,” I thought. “The T4 only draws around 70 W.”

What I didn’t know was that the R620’s motherboard wasn’t designed to deliver that kind of sustained PCIe power draw across its riser board. Within seconds of booting up with the T4 installed, the fans spun up to jet-engine speed, the system lights flashed amber, and then… silence.

The motherboard fried. No BIOS beep. No iDRAC. Just a solid, expensive lesson.

⚠️ Lesson #1 – Don’t Install Tesla T4 on R620

The R620 doesn’t provide enough dedicated PCIe slot power for modern GPUs. Even if it fits physically, it’s electrically unsafe. The Tesla T4 requires a modern platform with proper power allocation and cooling — such as the R640 or R740, but even then, you need ≥ 750 W PSUs.

So yes, I learned the hard way:

If you’re adding a GPU, always check your PSU capacity, PCIe lane design, and power headroom.


Part 2: The Silent Killer — RAID-5

After replacing the R620 with an R640, I thought I was safe. I re-used the same set of four SAS drives that were configured in RAID 5.

At first, everything looked great. Ubuntu booted up cleanly, my partitions mounted, and my files were all there. I congratulated myself — too early.

About 20 minutes later, while uploading files to S3, the system froze. Disk I/O stopped responding. Logs filled with “I/O error, device sda” and “Remounting filesystem read-only.” The RAID was failing under load.

I rebooted. It worked again for a few minutes, then failed again. Classic degraded array behavior — the RAID controller was trying to rebuild or read from missing stripes, and the filesystem responded by going read-only.

🧩 Lesson #2 – RAID-5 is Fast but Fragile

RAID-5 sounds attractive — performance with redundancy. But in practice:

  • It tolerates only one disk failure.
  • During rebuild, the system is at extreme risk — one more read error, and all data collapses.
  • And worst of all: RAID-5 arrays are hard to clone or migrate to a new system, because the data is striped across drives with controller-specific metadata.

If you need true reliability and easy cloning, use:

  • RAID 1 (mirror) → simple, durable, easy to recover.
  • RAID 10 (striped mirrors) → best balance between speed and redundancy.

RAID-5 is good for lab experiments, not critical workloads.


Part 3: The Panic Phase — Data Rescue the Naive Way

When you realize your array is dying, your instincts take over. Mine said, “Get the data out, now.”

The first idea was to sync everything to S3 using:

aws s3 sync /mnt/data s3://ohwise-backup/

That was a mistake.

S3 uploads are extremely I/O intensive — multiple threads reading chunks, checksumming, retrying, and writing logs. The already-unstable disks couldn’t handle it. The filesystem locked up again.

This was the moment I learned one of the most counterintuitive truths of system rescue:

When your disks are failing, simplicity saves data.

So instead, I grabbed a USB drive and went old school.


Part 4: Mounting the Lifeboat

First, I plugged in a new 32 GB USB stick and confirmed it was recognized:

lsblk

Then, I created a mount point and mounted it:

sudo mkdir -p /mnt/usb sudo mount /dev/sdb1 /mnt/usb

Next, I used the safest file copy tool in Linuxrsync:

sudo rsync -aHAXv --progress /mnt/data/backups /mnt/usb/

rsync preserves permissions, timestamps, and symbolic links, while avoiding unnecessary writes. It retries safely and doesn’t hammer the disk.

Every 5 GB felt like pulling treasure from a sinking ship.

When I saw the message:

sent 10,240,000 bytes  received 1,024 bytes  2.0MB/s
total size is 10,240,000  speedup is 1.00

I exhaled.

⚙️ Lesson #3 – When Disaster Strikes, Go Naive, Not Clever

  • Don’t use S3 sync on dying disks — it multiplies read load.

  • Use rsync or dd with --progress and --partial for safe incremental copying.

  • If needed, split large files before moving them to FAT-formatted USB drives:

    split -b 3G bigfile.sql bigfile_part_ cat bigfile_part_* > bigfile.sql # to restore
  • And always gracefully unmount after copying:

    sudo umount /mnt/usb

It’s slow, but it’s the kind of slowness that saves.


Part 5: Cloning with dd and the Art of Patience

After rescuing the essentials, I tried to clone the whole disk to a new 2 TB SAS drive:

sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync status=progress

At first, it flew at 200 MB/s. Then it dropped to 80 MB/s. Then 50 MB/s. Then 48 MB/s.

Disk cloning speed isn’t constant — it depends on:

  • inner vs. outer track positions,
  • thermal throttling,
  • background error correction,
  • and the kernel’s I/O scheduling.

But that’s okay. Because dd with conv=noerror,sync keeps going even when read errors occur, filling unreadable sectors with zeros.

Once it finished, I ran:

sudo sync

to flush all pending writes.

⚠️ Never skip sync after cloning. It ensures the new disk has a clean write boundary.


Part 6: When dd Isn’t Enough — Meet ddrescue

If your disk fails mid-way, dd can’t resume from where it left off. That’s where ddrescue shines:

sudo apt install gddrescue sudo ddrescue -f -n /dev/sda /dev/sdb rescue.log

It records what’s copied in a log file, so if you reboot or retry, it picks up right where it left off.

🧠 Lesson #4 – Always Keep a ddrescue USB Handy

A ddrescue bootable USB is your friend when a disk goes unstable. It’s the kind of tool you hope you never need — until you do.


Part 7: The Mount Command That Saved My Data

After cloning, I needed to verify what was on the new drive. Simple:

sudo mkdir /mnt/newdisk sudo mount /dev/sdb3 /mnt/newdisk ls /mnt/newdisk

If you see your files, the clone worked. If you see an empty directory, the partition table might need fixing:

sudo fdisk -l /dev/sdb

That moment when you open the directory and see your familiar folder names — /home, /mnt/data, /var/lib/mysql — it’s like breathing again after holding your breath underwater.


Part 8: MySQL and the Mount Point Lesson

One of my most valuable folders was /var/lib/mysql. That’s where all MariaDB data lived.

But here’s a subtle trap: When you mount a new drive onto /var/lib/mysql, the underlying folder is masked. If the mount fails, your database silently writes to the old folder under /var instead — leading to massive confusion.

The correct way is:

sudo systemctl stop mariadb sudo rsync -aHAXv /var/lib/mysql /mnt/data/mysql_backup/ sudo ln -s /mnt/data/mysql /var/lib/mysql sudo systemctl start mariadb

If used correctly, symlinks like /var/lib/mysql → /mnt/data/mysql separate data from the OS. If used incorrectly, they can cause silent data loss. Always test your mounts before restarting services.


Part 9: AWS S3 Command-Line Cheatsheet

Once the local recovery was stable, I focused on moving backups between my server and the cloud. Amazon S3 is perfect for long-term, redundant storage — but only if you master its CLI workflow.

Below is the condensed cheatsheet I now keep pinned inside my terminal notes.


🧩 Setup

sudo apt install -y awscli aws configure

You’ll be prompted for:

AWS Access Key ID: <your key>
AWS Secret Access Key: <your secret>
Default region name [us-east-1]:
Default output format [json]

Credentials are stored in ~/.aws/credentials.


☁️ Basic Upload / Download

ActionCommandNotes
Upload one fileaws s3 cp myfile.sql s3://my-bucket/backups/Copies to the bucket
Download one fileaws s3 cp s3://my-bucket/backups/myfile.sql .Pulls back to local
Upload a folder recursivelyaws s3 cp /mnt/data/backups/ s3://my-bucket/backups/ --recursiveMirrors entire folder
Download a folder recursivelyaws s3 cp s3://my-bucket/backups/ /mnt/data/backups/ --recursiveUseful for full restores

🔄 Syncing Two Folders

Keeps two directories in sync — uploads only new or changed files.

aws s3 sync /mnt/data/backups/ s3://my-bucket/backups/

To pull from the cloud:

aws s3 sync s3://my-bucket/backups/ /mnt/data/backups/

Options:

  • --delete → removes files not present in the source
  • --exclude "*.tmp" → skip temporary files
  • --dryrun → preview actions without executing

🧰 Other Handy Commands

# List all buckets aws s3 ls # List files in a bucket aws s3 ls s3://my-bucket/ # Check an object’s metadata aws s3api head-object --bucket my-bucket --key backups/myfile.sql # Remove an object aws s3 rm s3://my-bucket/backups/oldfile.sql

🎯 Lesson #6 – Always Keep an Offline Copy of Your Recovery Tools

Internet access is not guaranteed during recovery. Having the tools locally — rsync, gdown, ddrescue, smartctl, lsblk, and fdisk — can make or break your rescue plan.


Part 10: Rediscovering RAID

After rebuilding the system, I decided to revisit RAID from the ground up. Here’s what I learned, simplified:

RAID LevelDrivesFault TolerancePerformanceNotes
RAID 0≥ 2❌ None🚀 FastStripes data, no redundancy. One failure = total loss.
RAID 12✅ 1⚙️ ModeratePerfect mirror. Simple and reliable.
RAID 5≥ 3✅ 1⚡ GoodStriped with parity. Rebuilds are risky.
RAID 6≥ 4✅ 2⚙️ SlowerSafer than RAID 5, better for large arrays.
RAID 10≥ 4✅ 1 per pair🚀 Fast & SafeStriped mirrors. Best for DB and LLM workloads.

For my setup — two 2 TB SAS drives — I chose RAID 1.

It’s slower than RAID 0 but gives me peace of mind: if one drive dies, the other keeps running. And unlike RAID 5, I can clone or replace a single drive easily using:

sudo dd if=/dev/sda of=/dev/sdb bs=64K conv=noerror,sync status=progress

No controller dependency, no hidden striping logic — just clean mirroring.


Part 11: Designing the Final Setup

Now that everything was rebuilt, I redesigned my disk layout for clarity, scalability, and safety.

Mount PointSizePurpose
/boot/efi10 GBBoot partition
/600 GBOS + tools + containers
/mnt/data1.2 TBApplication data, MySQL, Ollama models, backups

📦 Why This Layout Works

  • / remains small and easy to back up or image.
  • /mnt/data lives on a separate partition (or RAID 1 mirror), making it safe to wipe the OS without touching data.
  • Databases, LLM weights, and S3 syncs all live under /mnt/data.

If Kubernetes or K3s uses local persistent volumes, I can just bind-mount /mnt/data and still maintain disk isolation.


Part 12: Reflection — What Hardware Teaches You About Software

Before this, I thought of servers as stable foundations. Now I realize — hardware is as fragile as code. And resilience isn’t just about redundancy; it’s about recovery simplicity.

Here’s what this experience permanently taught me:

  1. Don’t assume compatibility — check GPU, PSU, and motherboard specs before installing hardware.
  2. RAID ≠ Backup. Even mirrored arrays can fail catastrophically. Always keep cold copies.
  3. Use rsync, not panic. When things go wrong, slower tools are safer tools.
  4. Always mount intentionally. One wrong mount can bury data in plain sight.
  5. Keep offline utilities. Network tools can fail when you need them most.
  6. Never rely on S3 for emergency uploads. Disk I/O is sacred when recovery is in progress.
  7. Plan for human error. Because panic, not failure, causes most data loss.

Join the Discussion

Share your thoughts and insights about this system.