TrueNAS reboot loop, VM load, and the NVMe that would not stay seated

Overview

This is the build-and-break log for bringing a previously unstable TrueNAS/FreeBSD box back to a usable state after months of ugly crash behavior under VM load.

The short version:

  • Salvaged hardware repurposed as a NAS
  • A Fedora VM running Plex-related work could trigger a reboot loop
  • Early on, the machine was already rebooting itself in a loop — even before the NVMe path failure showed up clearly in logs
  • Hard evidence eventually pointed at one NVMe path (nvme2 / nda2) timing out and detaching under load
  • Reseating the suspect SN770 and replacing its heatsink brought the mirror back online
  • Current best hypothesis: physical seating, thermal contact, or mechanical instability — not a software bug

Environment

Storage Configuration

Mirror vdev (data volume):

logic-pool
 └── mirror-0
      ├── nda0p2  (WD_BLACK SN770 2TB)
      └── nda2p2  (WD_BLACK SN770 2TB)

Boot pool:

zroot
└── nda1p2   (Crucial P1 1TB)

Bulk storage pool:

spinning-pool
└── raidz2-0
    ├── ada0p2   (18 TB Seagate SATA)
    ├── ada1p2   (18 TB Seagate SATA)
    ├── ada2p2   (18 TB Seagate SATA)
    └── ada3p2   (18 TB Seagate SATA)

The original boot pool lived on a USB drive, and the Crucial SSD was used as a ZFS L2ARC cache device.


  • Host OS: FreeBSD replacing a prior TrueNAS SCALE install
  • Main pool: logic-pool
  • Secondary pool: spinning-pool
  • VM workload: fedora39
  • Virtualization: bhyve / vm-bhyve
  • Root symptom: VM activity — especially Plex-related access or media load — could destabilize the machine badly enough to trigger a reboot loop

Relevant storage devices at the time of investigation:

  • nda0WD_BLACK SN770 2TB0000000009
  • nda2WD_BLACK SN770 2TB0000000006
  • nda1Crucial P1 1TB0000000004
  • four 18 TB SATA disks in spinning-pool

Historical Context

This was not a single clean failure, and it was not a single-platform machine either. The box was a rescued workstation-class machine that had already been written off as obsolete and was headed for the trash. Instead of letting it die, I turned it into a NAS and kept evolving it through CORE, SCALE, and eventually bare FreeBSD.

The same box went through three distinct operating phases:

  • TrueNAS CORE during the first week of the build
  • TrueNAS SCALE once I started chasing GPU support
  • bare FreeBSD with the old storage imported and the old app state rebuilt as jails

The hardware matters here because this was never a purpose-built NAS chassis with server parts. The machine was built around:

  • a Gigabyte Z390 AORUS ULTRA motherboard
  • an Intel Core i9-9900K
  • 64 GB (4 x 16 GB) Corsair Vengeance RGB Pro DDR4-3200, part number CMW64GX4M4C3200C16, currently running at 2133 MT/s with XMP disabled
  • an EVGA SuperNOVA 850 T2 power supply from the original workstation build
  • two 2 TB WD_BLACK SN770 NVMe drives for the mirrored logic-pool
  • one 1 TB Crucial P1 SSD that eventually became the boot drive in the FreeBSD rebuild
  • an NVIDIA GeForce GTX 1080 that was passed through to the Fedora VM during the SCALE-era Plex setup
  • an ASMedia ASM116x PCIe SATA controller that was later removed during the hardware-reduction pass
  • four 18 TB Seagate SATA drives making up spinning-pool

There was also a SATA card in the system for part of this story. It turned out not to be the root cause, but it was removed anyway — by that stage I was stripping out anything redundant or suspicious just to reduce the number of moving parts.

The RAM kit is rated for 3200 MT/s, but XMP was disabled from the start, so the machine stayed at the board’s conservative 2133 MT/s the whole time. That mattered later because memory overclocking never had to stay on the suspect list.

The same ZFS datasets survived all three OS phases, and their creation dates still preserve parts of that history. They are useful as platform markers, and they also line up closely enough with hardware purchase dates to make the timeline clearer than my memory did. I ordered the two SN770 drives on January 7, 2024. logic-pool was created on January 12, 2024. Surviving jail activity shows the CORE setup was still active through January 19, 2024, and the first SCALE markers appear that same day. The dataset families that survived each era: iocage from CORE, ix-applications and later ix-apps from SCALE, and zroot plus bastille from the bare FreeBSD rebuild. The command I used to sanity-check that timeline was:

zfs list -o name,creation | sort -k2

The dates told a clean story:

  • the two SN770 drives were ordered on January 7, 2024
  • logic-pool was created on January 12, 2024
  • logic-pool/iocage was created on January 12, 2024 — surviving CORE-era marker on disk
  • file activity inside the old iocage jail roots shows the CORE setup was still actively in use through January 19, 2024
  • logic-pool/ix-applications appeared on January 19, 2024 — first surviving SCALE-era marker on disk
  • logic-pool/ix-apps showed up on November 9, 2024, which lines up with the later SCALE app-stack changes
  • logic-pool/bastille was created on March 8, 2026 — start of the FreeBSD jail rebuild
  • zroot was created on March 9, 2026 — the current bare FreeBSD install

The earlier CORE screenshot below is useful because it anchors logic-pool to that original phase instead of leaving the timeline purely abstract.

Creating logic-pool during the original TrueNAS CORE phase

That same early CORE period also shows the VM layout already depending on ZFS volumes inside logic-pool/VMs, which matters because the later Fedora/Plex behavior was not some brand-new SCALE-only invention. The VM storage path was already there from the CORE side.

CORE-era VM disk backed by a ZFS volume in logic-pool/VMs

The actual move from CORE to SCALE was not subtle either. The migration warning below makes the platform transition explicit instead of relying only on dataset timestamps.

TrueNAS CORE warning dialog shown before migrating the system to SCALE

Later SCALE screenshots also captured another useful detail: by then the stack was already aware of both the Intel iGPU and the NVIDIA card — GPU-assisted Plex transcoding had become part of how the box was being used.

TrueNAS SCALE showing Intel and NVIDIA GPU resources available to the app stack

Once the box was on SCALE and the Fedora VM plus Plex became part of normal use, the machine developed the kind of instability that makes you question every layer at once. VM activity would start, Plex-related load would kick in, and the system would destabilize — sometimes badly enough to reboot-loop or become completely unusable.

That long tail of instability led to several hardware simplification steps before the final NVMe diagnosis:

  • the video card was removed weeks earlier
  • the SATA card was removed
  • the USB drive was removed
  • RAM was reseated
  • the ARC2 device was removed from that role
  • the former ARC2 SSD was repurposed as the boot drive
  • the storage path was simplified by connecting drives more directly instead of through extra hardware where possible

None of those changes stopped the reboot-loop issue.

Eventually TrueNAS SCALE took a more obvious failure mode: instead of coming up normally, it reached GRUB and just sat there. At that point the system was no longer in the category of “annoying but still serviceable.” It was broken enough that rebuilding the platform made more sense than pretending another restart would reveal something new.

Rebuilding on FreeBSD

Once the new FreeBSD install was in place, the existing ZFS pools were imported rather than recreated:

zpool import
zpool import logic-pool
zpool import spinning-pool

The pools were not destroyed. They were re-imported from existing ZFS metadata on disk.

The next problem was reconstructing the application layout that had existed on TrueNAS SCALE — and that part was not rebuilt from memory. It was reconstructed from two sources:

  • the SCALE backup database (freenas-v1.db)
  • the old ix-applications dataset on logic-pool

The database gave me the system-side clues. The old app dataset still held the per-app state, configs, and databases. That combination made it possible to recover the stack instead of rebuilding every service blind. The practical extraction step was to inspect what the old app dataset still held:

find /logic-pool/ix-applications/releases -maxdepth 3 -type f | head -n 200

From there, the old app state was copied out of the SCALE dataset and rebuilt as BastilleBSD jails on FreeBSD, eventually producing working replacements for the core services again.

The Failure Event That Finally Produced Useful Evidence

The turning point came when the VM was restarted and the machine finally produced storage errors that were concrete instead of vague.

The first screenshot is the log evidence that made the failure mode explicit: nvme2 reset, outstanding I/O failed, and nda2 detached from the system.

Kernel log showing nvme2 timeouts and the mirror member detaching

Around the time fedora39 restarted, the logs showed:

  • tap0 flapping as the VM interface came up
  • repeated controller resets on nvme2
  • aborted reads and writes on nda2
  • the device detaching from the system

Representative log pattern:

nvme2: Resetting controller due to a timeout and possible hot unplug.
nvme2: failing outstanding i/o
(nda2:nvme2:0:0:1): CAM status: NVME Status Error
(nda2:nvme2:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
nda2: <WD_BLACK SN770 2TB 731100WD 0000000006> ... detached

At that point logic-pool degraded because the mirror member backed by /dev/nda2p2 disappeared.

The immediate verification commands after that were the obvious ones:

nvmecontrol devlist
smartctl -a /dev/nvme2
zpool status -P logic-pool

After the drive reappeared, the pool came back online — but the system was not clean yet. At that point I brought the missing mirror member back with:

zpool online logic-pool nda2p2

ZFS then repaired and resilvered the mirror. The next screenshot shows that intermediate state: the mirror was back, but the checksum error was still being reported.

Pool back online after the missing mirror member reappeared, with the checksum error still present

Why This Did Not Look Like a Pure ZFS or zvol Bug

It was reasonable to ask whether the VM using a zvol was the root cause.

The hard evidence did not support that theory. What the logs showed was lower-level than ZFS:

  • the NVMe controller path reset repeatedly
  • the drive detached from the OS
  • the pool only degraded because the device path vanished

A zvol-backed VM could absolutely create enough I/O to expose a weak device, weak seating, bad thermal contact, or a bad slot. But a zvol does not explain an NVMe controller vanishing from the bus.

The better interpretation: VM load was the trigger, and the storage path was the actual failure domain.

Diagnostic Step: Swap the Two SN770s

The next useful test was to swap the two WD SN770 drives between slots.

That mattered because it separates two very different problems:

  • a drive problem that follows the module
  • a slot, board, power, or thermal problem that stays with the motherboard position

That swap did not instantly settle the question, but it pushed the investigation toward physical handling and fitment instead of software superstition.

A later diagnostic step made things worse again for a moment. After the swap, the suspect drive was not seated correctly in the top slot yet. That is what the degraded-pool screenshot below captures.

Pool degraded after the swapped drive was not seated correctly in the top slot

The Physical Fix

The suspect SN770 was then reseated and given a different heatsink arrangement.

The small NVMe modules were not mounted as cleanly or as flat as they should have been. I skipped the proper standoffs, improvised the fitment, and probably created most of this pain for myself by treating that as a detail instead of part of the system design. The temporary setup still relied on improvised support instead of the final adapter hardware I was waiting on.

I still cannot prove that every symptom came from the mounting alone — but by the end it was hard to ignore the pattern. Subtle mechanical pressure, board flex, or uneven heatsink contact can create exactly the kind of intermittent NVMe nonsense that looks like software right up until the controller starts resetting in plain sight.

Result After Reseat and Heatsink Change

After reseating the suspect drive and replacing the heatsink, the pool recovered.

Final healthy state:

Final healthy state with both mirror members back online and no known data errors

The final cleanup after the pool was stable again:

zpool clear logic-pool
zpool status -v logic-pool

Current Hypothesis and Remaining Unknowns

The evidence points to a failure in the physical NVMe path rather than a software fault.

The strongest indicators are:

  • nvme2 controller repeatedly resetting under load
  • the device (nda2) detaching from the system
  • the problem following the same SN770 module even after the drives were swapped between slots
  • recovery after reseating the module and changing the heatsink arrangement

None of that completely eliminates the motherboard slot, PCIe lane, or power delivery as possible contributors. What it does show is that the instability behaved like a marginal hardware path exposed by heavy I/O — not a bug in ZFS, bhyve, or the VM stack.

Until the replacement drive arrives and the NVMe mounting is rebuilt with proper hardware, the system should still be treated as running on a provisional fix rather than a fully trusted storage configuration.

Filing the RMA

After the later rounds of swapping, reseating, and watching the pool come back online, I stopped treating this as an academic question and filed an RMA for the suspect SN770.

What pushed it over the line was not one dramatic crash by itself. It was the whole chain taken together:

  • the same physical SSD remained the main suspect even after moving it
  • Linux had previously shown the more violent reboot-loop version of the problem
  • FreeBSD showed the calmer but more useful failure mode where the NVMe path reset, detached, and forced ZFS to resilver the mirror when it came back
  • the pool recovered, but only after enough physical handling and reseating that trusting the drive long-term stopped sounding reasonable

That is where this part of the story ends. The drive has an approved RMA, the machine is back online for now, and the next meaningful chapter is what happens after the replacement arrives and the temporary mounting situation is finally cleaned up.

What to Watch Until Then

Until the replacement is installed, the box still needs to survive the same class of load that used to break it.

Things worth watching:

  • VM start and stop cycles
  • Plex-related activity inside the Fedora VM
  • dmesg -w for renewed nvme2 resets
  • zpool status -v logic-pool

If the machine stays stable under the same load that used to trigger reboot loops, the physical seating and thermal hypothesis gets stronger. If the problem returns before the replacement arrives, the remaining suspects are still the same familiar ones: the SN770 module itself, the motherboard slot or PCIe path, or power delivery to that slot.

Hardware Cost Perspective

One thing that becomes obvious when revisiting this build is how different the cost would be to recreate it today.

The two NVMe drives used for the mirrored logic-pool were WD_BLACK SN770 2 TB modules that originally cost $121.99 each. The current listing price for the same model is $499.99 per drive$378 more per unit. Replacing both today would cost $756 more than the original purchase.

The bulk storage pool tells a similar story. The four 18 TB spinning disks were purchased together for $1,477.95 total, roughly $369.49 per drive. The current listing price for the same capacity is $499.99 each, or $130.50 more per drive$522 more for all four.

Rebuilding the storage configuration alone today would cost roughly $1,278 more than what the system originally cost to assemble. That price difference is a useful reminder that a lot of this project depends on opportunistic hardware timing as much as it does on software architecture.

Part of the reason for the price difference is the recent surge in AI infrastructure demand. Large-scale training clusters consume enormous amounts of both NVMe flash and high-capacity hard drives, tightening supply and pushing prices upward across the storage market. Enterprise demand from hyperscale data centers tends to gobble inventory first, leaving consumer components like these SN770 modules and large SATA drives subject to the same upward pressure.

The little guy can gofuckthemselves.

Conclusion

The most useful lesson from this incident is that not every ugly virtualization or storage failure is really a software architecture problem. The instability here had been blamed at various points on ZFS, bhyve, zvol I/O patterns, and the VM stack — and none of those theories held up once the NVMe controller started resetting in plain sight.

Physical fit matters. Improvised mounting, skipped standoffs, and uneven heatsink contact are not cosmetic problems. On a system running storage under sustained I/O load, marginal mechanical contact is a failure waiting for a trigger.

This is where part 1 ends. Part 2 starts when the replacement SN770 arrives.