Ever since I first got my Librem 15 (version 3), I’ve had occasional boot failures. Eventually it spontaneously resolves. I use the larger NVMe drive (/dev/nvme0n1) for /home and the smaller SATA 3 drive (/dev/sda) for everything else, including /boot and the GRUB installation. The SATA 3 drive is partitioned into an unencrypted /boot partition and an LUKS-encrypted partition containing the rest of the root directory.
I can always get through the BIOS to the GRUB menu and select a kernel. However, it then indefinitely hangs with the GRUB splash screen in the background, or goes to a black screen with the message “Loading initial ramdisk”. When this happens, it never proceeds to loading the kernel (I don’t use the quiet or splash boot parameters, so I see all the kernel boot messages), and I never experience hangs after the kernel loads.
Things I’ve tried:
setting GRUB_CMDLINE_LINUX_DEFAULT="nvme_load=YES" in /etc/default/grub
searching journalctl and dmesg for useful error messages
loading the same kernel from GRUB manually rather than using the default kernel
searching message boards and Stack Exchange for solutions
Any ideas on how to troubleshoot and fix this? I noticed that some ASUS machines with Intel CPUs have had problems loading microcode updates (see this bug). Although that just seems to affect ASUS boxes, I need to see if disabling microcode updates by adding the dis_ucode_ldr boot parameter fixes the problem — (un?)fortunately this current kernel fixed itself and so I can’t reproduce it right now.
System information:
Hardware: Librem 15 (version 3), default 120 GB SATA 3 SSD with aftermarket 1 TB Samsung 960 EVO NVMe SSD
Granted that you have gone to some trouble to set it up like this and granted that it should work, but wouldn’t it be better to boot from the much faster NVMe drive?
Even if you don’t like that arrangement, it may be a valid test for fault isolation purposes. However
that may make it difficult to do any troubleshooting.
You should at least be able to use GRUB to boot the one previous kernel version. Does that still exhibit the problem?
(It’s possible that an actual kernel bugfix was backported and so you won’t see the problem again with the current 18.04 LTS kernel or near term future updates to it.)
nvme is faster and makes more sense to set up for the /boot and leave the /home as a separate partition on an external SATA 3 drive.
the access times on the NVME SSDs are greater than SATA SSDs … so lower latency R/W access times are more useful to use for the OS related activities during boot up and everything else that follows … SATA is an old protocol and is very close to max saturation already …
Due to the wide disparity in disk size, that may not be acceptable or even possible, although it may be possible to split /home between the two drives if there are multiple users and that works out.
Adding: Or did you mean to buy a third disk? The 1TB NVMe disk would have cost a reasonable amount. I think I would want to use it.
I’d disagree. My old laptop boots in 30 seconds off of sata drive. From power-on to gdm login prompt.
The new one with NvMe takes 10 seconds.
Both times are short enough to not make any difference for me.
In general, Debian or PuereOS both boot fast. And after it’s booted, the disk activity is next to none. So either give fastest disk to /home, or the largest. which may be the same.
I can confidently say that apart from this occasional issue, this setup works well and my bootup times are acceptable (<30 seconds). I thought hard about using NVMe for /boot and GRUB but I wanted to separate out the /home drive because it’s easier to manage that way from a backup and upgrade standpoint. At the time I set this up, I had also read that GRUB and Linux in general struggles sometimes to boot directly from NVMe, although that might no longer be true (if it ever was). It would be a significant amount of work but I could potentially change this if we can demonstrate that there’s not another workaround, but I’m pretty happy with the setup overall.
@mladen I will get back to you ASAP — I’m at work right now but I can let you know when I get back home. I used the script MrChromebox uploaded here in August 2019 in case that information is helpful in the meantime.
@kieran That’s a good point. I’ll go through old kernels and see if I can reproduce it. I think that at some point GRUB or something else resolves it, but it happens regularly enough that I’m sure it will come back. I’ll try tonight! Unfortunately once it resolves it seems to stop happening for that kernel version. However, I might be able to reinstall a kernel that I had a problem with before and see if I can make it happen that way.
Any other thoughts on how to go about the troubleshooting process? At the moment I’m not sure what the next steps should be. I’ll definitely provide the information above as soon as I can.
It almost certainly was true at one point of time in the past - a completely new disk technology will flummox a BIOS that has no code to support it! However it hasn’t been true in general for a few years. My oldest computer that I actually boot from NVMe was purchased in 2016.
Obviously any computer where it is a valid configuration to have a single disk drive that is an NVMe drive must be able to boot from it.
However that doesn’t tell you when NVMe support was specifically available in Coreboot.
Anyway let’s leave that and accept that it is set up the way it is set up.
@mladen Updated coreboot; problem unfortunately still occurs. I fear flashing: it terrifyingly failed to get to the BIOS the first two times I started the computer afterwards, and now the initial splash screen is blurry, off center, and with a large black box to the right. Hopefully I’ll figure that out later.
I noticed while testing that the same kernel versions which cause indefinite hangs sometimes inexplicably succeed without any problem. So I do have kernels I can test with, since repeated boots seem to eventually trigger the event.