TrueNAS reboot loop, VM load, and the NVMe that would not stay seated
Overview
This is the build-and-break log for bringing a previously unstable TrueNAS/FreeBSD box back to a usable state after months of ugly crash behavior under VM load.
The short version:
- Salvaged hardware repurposed as a NAS
- A Fedora VM running Plex-related work could trigger a reboot loop
- Early on, the machine was already rebooting itself in a loop — even before the NVMe path failure showed up clearly in logs
- Hard evidence eventually pointed at one NVMe path (
nvme2/nda2) timing out and detaching under load - Reseating the suspect SN770 and replacing its heatsink brought the mirror back online
- Current best hypothesis: physical seating, thermal contact, or mechanical instability — not a software bug
Environment
Storage Configuration
Mirror vdev (data volume):
logic-pool
└── mirror-0
├── nda0p2 (WD_BLACK SN770 2TB)
└── nda2p2 (WD_BLACK SN770 2TB)
Boot pool:
zroot
└── nda1p2 (Crucial P1 1TB)
Bulk storage pool:
spinning-pool
└── raidz2-0
├── ada0p2 (18 TB Seagate SATA)
├── ada1p2 (18 TB Seagate SATA)
├── ada2p2 (18 TB Seagate SATA)
└── ada3p2 (18 TB Seagate SATA)
The original boot pool lived on a USB drive, and the Crucial SSD was used as a ZFS L2ARC cache device.
- Host OS: FreeBSD replacing a prior TrueNAS SCALE install
- Main pool:
logic-pool - Secondary pool:
spinning-pool - VM workload:
fedora39 - Virtualization: bhyve /
vm-bhyve - Root symptom: VM activity — especially Plex-related access or media load — could destabilize the machine badly enough to trigger a reboot loop
Relevant storage devices at the time of investigation:
nda0—WD_BLACK SN770 2TB—0000000009nda2—WD_BLACK SN770 2TB—0000000006nda1—Crucial P1 1TB—0000000004- four 18 TB SATA disks in
spinning-pool
Historical Context
This was not a single clean failure, and it was not a single-platform machine either. The box was a rescued workstation-class machine that had already been written off as obsolete and was headed for the trash. Instead of letting it die, I turned it into a NAS and kept evolving it through CORE, SCALE, and eventually bare FreeBSD.
The same box went through three distinct operating phases:
- TrueNAS CORE during the first week of the build
- TrueNAS SCALE once I started chasing GPU support
- bare FreeBSD with the old storage imported and the old app state rebuilt as jails
The hardware matters here because this was never a purpose-built NAS chassis with server parts. The machine was built around:
- a Gigabyte Z390 AORUS ULTRA motherboard
- an Intel Core i9-9900K
- 64 GB (4 x 16 GB) Corsair Vengeance RGB Pro DDR4-3200, part number CMW64GX4M4C3200C16, currently running at 2133 MT/s with XMP disabled
- an EVGA SuperNOVA 850 T2 power supply from the original workstation build
- two 2 TB WD_BLACK SN770 NVMe drives for the mirrored
logic-pool - one 1 TB Crucial P1 SSD that eventually became the boot drive in the FreeBSD rebuild
an NVIDIA GeForce GTX 1080 that was passed through to the Fedora VM during the SCALE-era Plex setupan ASMedia ASM116x PCIe SATA controller that was later removed during the hardware-reduction pass- four 18 TB Seagate SATA drives making up
spinning-pool
There was also a SATA card in the system for part of this story. It turned out not to be the root cause, but it was removed anyway — by that stage I was stripping out anything redundant or suspicious just to reduce the number of moving parts.
The RAM kit is rated for 3200 MT/s, but XMP was disabled from the start, so the machine stayed at the board’s conservative 2133 MT/s the whole time. That mattered later because memory overclocking never had to stay on the suspect list.
The same ZFS datasets survived all three OS phases, and their creation dates still preserve parts of that history. They are useful as platform markers, and they also line up closely enough with hardware purchase dates to make the timeline clearer than my memory did. I ordered the two SN770 drives on January 7, 2024. logic-pool was created on January 12, 2024. Surviving jail activity shows the CORE setup was still active through January 19, 2024, and the first SCALE markers appear that same day. The dataset families that survived each era: iocage from CORE, ix-applications and later ix-apps from SCALE, and zroot plus bastille from the bare FreeBSD rebuild. The command I used to sanity-check that timeline was:
zfs list -o name,creation | sort -k2
The dates told a clean story:
- the two SN770 drives were ordered on January 7, 2024
logic-poolwas created on January 12, 2024logic-pool/iocagewas created on January 12, 2024 — surviving CORE-era marker on disk- file activity inside the old
iocagejail roots shows the CORE setup was still actively in use through January 19, 2024 logic-pool/ix-applicationsappeared on January 19, 2024 — first surviving SCALE-era marker on disklogic-pool/ix-appsshowed up on November 9, 2024, which lines up with the later SCALE app-stack changeslogic-pool/bastillewas created on March 8, 2026 — start of the FreeBSD jail rebuildzrootwas created on March 9, 2026 — the current bare FreeBSD install
The earlier CORE screenshot below is useful because it anchors logic-pool to that original phase instead of leaving the timeline purely abstract.

That same early CORE period also shows the VM layout already depending on ZFS volumes inside logic-pool/VMs, which matters because the later Fedora/Plex behavior was not some brand-new SCALE-only invention. The VM storage path was already there from the CORE side.

The actual move from CORE to SCALE was not subtle either. The migration warning below makes the platform transition explicit instead of relying only on dataset timestamps.
Later SCALE screenshots also captured another useful detail: by then the stack was already aware of both the Intel iGPU and the NVIDIA card — GPU-assisted Plex transcoding had become part of how the box was being used.

Once the box was on SCALE and the Fedora VM plus Plex became part of normal use, the machine developed the kind of instability that makes you question every layer at once. VM activity would start, Plex-related load would kick in, and the system would destabilize — sometimes badly enough to reboot-loop or become completely unusable.
That long tail of instability led to several hardware simplification steps before the final NVMe diagnosis:
- the video card was removed weeks earlier
- the SATA card was removed
- the USB drive was removed
- RAM was reseated
- the ARC2 device was removed from that role
- the former ARC2 SSD was repurposed as the boot drive
- the storage path was simplified by connecting drives more directly instead of through extra hardware where possible
None of those changes stopped the reboot-loop issue.
Eventually TrueNAS SCALE took a more obvious failure mode: instead of coming up normally, it reached GRUB and just sat there. At that point the system was no longer in the category of “annoying but still serviceable.” It was broken enough that rebuilding the platform made more sense than pretending another restart would reveal something new.
Rebuilding on FreeBSD
Once the new FreeBSD install was in place, the existing ZFS pools were imported rather than recreated:
zpool import
zpool import logic-pool
zpool import spinning-pool
The pools were not destroyed. They were re-imported from existing ZFS metadata on disk.
The next problem was reconstructing the application layout that had existed on TrueNAS SCALE — and that part was not rebuilt from memory. It was reconstructed from two sources:
- the SCALE backup database (
freenas-v1.db) - the old
ix-applicationsdataset onlogic-pool
The database gave me the system-side clues. The old app dataset still held the per-app state, configs, and databases. That combination made it possible to recover the stack instead of rebuilding every service blind. The practical extraction step was to inspect what the old app dataset still held:
find /logic-pool/ix-applications/releases -maxdepth 3 -type f | head -n 200
From there, the old app state was copied out of the SCALE dataset and rebuilt as BastilleBSD jails on FreeBSD, eventually producing working replacements for the core services again.
The Failure Event That Finally Produced Useful Evidence
The turning point came when the VM was restarted and the machine finally produced storage errors that were concrete instead of vague.
The first screenshot is the log evidence that made the failure mode explicit: nvme2 reset, outstanding I/O failed, and nda2 detached from the system.

Around the time fedora39 restarted, the logs showed:
tap0flapping as the VM interface came up- repeated controller resets on
nvme2 - aborted reads and writes on
nda2 - the device detaching from the system
Representative log pattern:
nvme2: Resetting controller due to a timeout and possible hot unplug.
nvme2: failing outstanding i/o
(nda2:nvme2:0:0:1): CAM status: NVME Status Error
(nda2:nvme2:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
nda2: <WD_BLACK SN770 2TB 731100WD 0000000006> ... detached
At that point logic-pool degraded because the mirror member backed by /dev/nda2p2 disappeared.
The immediate verification commands after that were the obvious ones:
nvmecontrol devlist
smartctl -a /dev/nvme2
zpool status -P logic-pool
After the drive reappeared, the pool came back online — but the system was not clean yet. At that point I brought the missing mirror member back with:
zpool online logic-pool nda2p2
ZFS then repaired and resilvered the mirror. The next screenshot shows that intermediate state: the mirror was back, but the checksum error was still being reported.
Why This Did Not Look Like a Pure ZFS or zvol Bug
It was reasonable to ask whether the VM using a zvol was the root cause.
The hard evidence did not support that theory. What the logs showed was lower-level than ZFS:
- the NVMe controller path reset repeatedly
- the drive detached from the OS
- the pool only degraded because the device path vanished
A zvol-backed VM could absolutely create enough I/O to expose a weak device, weak seating, bad thermal contact, or a bad slot. But a zvol does not explain an NVMe controller vanishing from the bus.
The better interpretation: VM load was the trigger, and the storage path was the actual failure domain.
Diagnostic Step: Swap the Two SN770s
The next useful test was to swap the two WD SN770 drives between slots.
That mattered because it separates two very different problems:
- a drive problem that follows the module
- a slot, board, power, or thermal problem that stays with the motherboard position
That swap did not instantly settle the question, but it pushed the investigation toward physical handling and fitment instead of software superstition.
A later diagnostic step made things worse again for a moment. After the swap, the suspect drive was not seated correctly in the top slot yet. That is what the degraded-pool screenshot below captures.
The Physical Fix
The suspect SN770 was then reseated and given a different heatsink arrangement.
The small NVMe modules were not mounted as cleanly or as flat as they should have been. I skipped the proper standoffs, improvised the fitment, and probably created most of this pain for myself by treating that as a detail instead of part of the system design. The temporary setup still relied on improvised support instead of the final adapter hardware I was waiting on.
I still cannot prove that every symptom came from the mounting alone — but by the end it was hard to ignore the pattern. Subtle mechanical pressure, board flex, or uneven heatsink contact can create exactly the kind of intermittent NVMe nonsense that looks like software right up until the controller starts resetting in plain sight.
Result After Reseat and Heatsink Change
After reseating the suspect drive and replacing the heatsink, the pool recovered.
Final healthy state:
The final cleanup after the pool was stable again:
zpool clear logic-pool
zpool status -v logic-pool
Current Hypothesis and Remaining Unknowns
The evidence points to a failure in the physical NVMe path rather than a software fault.
The strongest indicators are:
nvme2controller repeatedly resetting under load- the device (
nda2) detaching from the system - the problem following the same SN770 module even after the drives were swapped between slots
- recovery after reseating the module and changing the heatsink arrangement
None of that completely eliminates the motherboard slot, PCIe lane, or power delivery as possible contributors. What it does show is that the instability behaved like a marginal hardware path exposed by heavy I/O — not a bug in ZFS, bhyve, or the VM stack.
Until the replacement drive arrives and the NVMe mounting is rebuilt with proper hardware, the system should still be treated as running on a provisional fix rather than a fully trusted storage configuration.
Filing the RMA
After the later rounds of swapping, reseating, and watching the pool come back online, I stopped treating this as an academic question and filed an RMA for the suspect SN770.
What pushed it over the line was not one dramatic crash by itself. It was the whole chain taken together:
- the same physical SSD remained the main suspect even after moving it
- Linux had previously shown the more violent reboot-loop version of the problem
- FreeBSD showed the calmer but more useful failure mode where the NVMe path reset, detached, and forced ZFS to resilver the mirror when it came back
- the pool recovered, but only after enough physical handling and reseating that trusting the drive long-term stopped sounding reasonable
That is where this part of the story ends. The drive has an approved RMA, the machine is back online for now, and the next meaningful chapter is what happens after the replacement arrives and the temporary mounting situation is finally cleaned up.
What to Watch Until Then
Until the replacement is installed, the box still needs to survive the same class of load that used to break it.
Things worth watching:
- VM start and stop cycles
- Plex-related activity inside the Fedora VM
dmesg -wfor renewednvme2resetszpool status -v logic-pool
If the machine stays stable under the same load that used to trigger reboot loops, the physical seating and thermal hypothesis gets stronger. If the problem returns before the replacement arrives, the remaining suspects are still the same familiar ones: the SN770 module itself, the motherboard slot or PCIe path, or power delivery to that slot.
Hardware Cost Perspective
One thing that becomes obvious when revisiting this build is how different the cost would be to recreate it today.
The two NVMe drives used for the mirrored logic-pool were WD_BLACK SN770 2 TB modules that originally cost $121.99 each. The current listing price for the same model is $499.99 per drive — $378 more per unit. Replacing both today would cost $756 more than the original purchase.
The bulk storage pool tells a similar story. The four 18 TB spinning disks were purchased together for $1,477.95 total, roughly $369.49 per drive. The current listing price for the same capacity is $499.99 each, or $130.50 more per drive — $522 more for all four.
Rebuilding the storage configuration alone today would cost roughly $1,278 more than what the system originally cost to assemble. That price difference is a useful reminder that a lot of this project depends on opportunistic hardware timing as much as it does on software architecture.
Part of the reason for the price difference is the recent surge in AI infrastructure demand. Large-scale training clusters consume enormous amounts of both NVMe flash and high-capacity hard drives, tightening supply and pushing prices upward across the storage market. Enterprise demand from hyperscale data centers tends to gobble inventory first, leaving consumer components like these SN770 modules and large SATA drives subject to the same upward pressure.
The little guy can gofuckthemselves.
Conclusion
The most useful lesson from this incident is that not every ugly virtualization or storage failure is really a software architecture problem. The instability here had been blamed at various points on ZFS, bhyve, zvol I/O patterns, and the VM stack — and none of those theories held up once the NVMe controller started resetting in plain sight.
Physical fit matters. Improvised mounting, skipped standoffs, and uneven heatsink contact are not cosmetic problems. On a system running storage under sustained I/O load, marginal mechanical contact is a failure waiting for a trigger.
This is where part 1 ends. Part 2 starts when the replacement SN770 arrives.