SATA drive link layer warnings ("ATA bus error") on boot

jrm · July 29, 2017, 11:18pm

Received my librem 13 two days ago and just started using it today. Was fine for about five hours, then it just rebooted and started failing to boot. Seeing errors like this on boot:

ata3.00: exception Emask 0x10 SAct 0x20000000 SErr 0x280100 action 0x6 frozen
ata3.00: irq_stat 0x08000000, interface fatal error
ata3: SError: { UnrecovData 10B8B BadCRC }
ata3.00: failed command: READ FPDMA QUEUED
ata3.00: cmd 60/08:e8:30:5f:38/00:00:3a:00:00/40 tag 29 ncq dma 4096 in
         res 40/00:e8:30:5f:38/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }

repeating five more times with different tag numbers

also see i915 0000:00:02.0: firmware: failed to load i915/skl_dmc_ver1_26 before those errors, but pretty sure that is unrelated (some with the intel graphics)

after these errors, I get this:

WARNING: Failed to connect to lvmetad. Failling back to device scanning.
Volume group "crypt" not found
Cannot process volume group crypt
...
2 logical volume(s) in volume group "crypt" not active
cryptsetup (sda5_crypt_: set up successfully

Then it just hangs, probably because it can’t access much of anything on the disks.

jrm · July 30, 2017, 1:12am

After a full power cycle, I then got even more libata errors, along with the two errors:

usb 1-3: firmware: failed to load ar3k/AthrBT_0x10020100.dfu (-2)
Bluetooth: Loading patch file failed

And it hangs as before. I can boot into recovery mode.

d2r · July 31, 2017, 2:36am

ata3.00: exception Emask 0x10 SAct 0x20000000 SErr 0x280100 action 0x6 frozen
ata3.00: irq_stat 0x08000000, interface fatal error
ata3: SError: { UnrecovData 10B8B BadCRC }
ata3.00: failed command: READ FPDMA QUEUED
ata3.00: cmd 60/08:e8:30:5f:38/00:00:3a:00:00/40 tag 29 ncq dma 4096 in
         res 40/00:e8:30:5f:38/00:00:3a:00:00/40 Emask 0x10 (ATA bus error)
ata3.00: status: { DRDY }

For what it’s worth, I see very similar messages on my new Librem13v2. I am using a SATA m.2 ssd. I am not sure if this is related to other resume from suspend errors. I see these messages on Fedora, Debian, PureOS, and Qubes.

A quick search showed others mentioning bad sata cables being the likely cause. Since this is a laptop, that does not seem like a good explanation.

EDIT to add: I have not run any benchmarks yet, but otherwise I did not notice any adverse performance of the laptop. I am suspecting that unrecoverable reads here would lead to a simple re-read of the data from disk, so I was not too worried about it. Could someone at Purism advise if this is an indication of some problem at the hardware level?

kakaroto · August 2, 2017, 7:58pm

Hi,
Yes, those errors are “normal”, but they are not indicating a hardware issue. I’ve been writing a blog post to explain exactly that since June 27th, but I keep getting distracted with other things to do, then I continue writing the blog post, then distracted again, etc… (and you all know how detailed/verbose I like to be in my blog posts).
But to summarize the issue : there is some undocumented magical setting that needs to be set on the SATA controller to prevent these link errors in SATA communications. I have been unable (so far, but still looking) to find what it is, but I have a workaround for it in coreboot so the issue doesn’t affect grub at least (https://github.com/kakaroto/coreboot/commit/a841e79238aa8a284ac06bda9a1500f6f00fe409).

The thing though is that the issue only happens when the SATA drive is configured to work on 6gbps speeds, and when the errors happen, the linux kernel will automatically retry, if it succeeds to read when it retries, then it continues, until there’s another error, then it retries the read, etc… if when it retries the read, it’s unable to read the data a second time, then it drops the connection speed to 3Gbps, and that will completely stop any errors and the SATA works just fine at 3Gbps speeds (that’s why you sometimes see only 1 error, sometimes 3, sometimes 20… but you’ll always notice that the last ‘ata’ message in dmesg before the errors stopped was about the speed being changed to 3gbps).
So anyways, that’s how it is. Hopefully I’ll be able to finish my first draft of the blog post sometime this week, then it will go through a couple of revisions and i’ll post it so you can understand in more details what exactly is behind this issue, but to summarize things : it is nothing to be worried about, the errors are recovered by the linux kernel automatically, if not, then speed drops to 3gbps in which case the errors disappear entirely.

As for the original post here, it looks to me like it has something to do with the encryption password. I have no idea why your machine rebooted on its own (was the battery low?) but it looks like your install doesn’t find the encrypted partition or it can’t decrypt it (wrong password?) or maybe something caused it reboot at the wrong time and it corrupted your partition… I have no idea, but the "volume group ‘crypt’ not found’ sounds like it… although it then says it found 2 logical volumes in the group ‘crypt’, and says that cryptsetup was set up successfully (which means the password is correct I think, but i’m not much of an expert on that area)…
as for the “failed to load firmware”, you can ignore it, the GPU can have an optional firmware loaded into it that slightly increases performance, but since PureOS is a binary free OS, it does not come with the proprietary firmware for the graphics card (i915 is the graphics driver) but it won’t affect the ability to use the GPU. I think there should be another similar warning about the bluetooth driver missing, that one is mandatory for bluetooth to work, so without it, there is no bluetooth support in PureOS, but it’s irrelevant to your issue as well.

mladen · August 3, 2017, 8:54am

@kakaroto, thank you very much for the explanation!

We had a conversation over email, he simply reinstalled PureOS and it manage to boot successfully.

d2r · August 5, 2017, 2:59am

Thanks @kakaroto, that’s informative.

mpc · May 10, 2018, 7:22pm

I’m still seeing the

WARNING: Failed to connect to lvmetad. Failling back to device scanning.
Volume group “[name]” not found
Cannot process volume group [name]

@kakaroto : did you ever write that blog post? Should I be seeing this with latest coreboot build?

kakaroto · May 11, 2018, 6:05pm

@mpc: I don’t know what that warning is, but it looks like it’s an installation issue, and I suggest you ask about it in a PureOS tagged forum thread. That’s completely unrelated to coreboot or SATA, but rather a partition OR a configuration issue in your OS.
The SATA bus warnings that older coreboot releases had and which are indeed fixed with the latest build looks more like this :

ata3: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x10 frozen
ata3.00: failed command: READ FPDMA QUEUED
ata3.00: cmd 60/04:00:d4:82:85/00:00:1f:00:00/40 tag 0 ncq 2048 in
res 40/00:18:d3:82:85/00:00:1f:00:00/40 Emask 0x4 (timeout)
ata3.00: status: { DRDY }

The blog post that explains it all is here

mladen · June 13, 2018, 7:57pm

@mpc, this should be fixed already, have a look here, please: https://tracker.pureos.net/T433