Buying Purism devices without DRAM memory or SSD storage

I have a simple request: when offering upgradeable laptops and desktops in the future, please offer options for no memory and/or no storage. Lately, data center demand for the chips at the heart of these devices has exploded, to the point where even ancient DDR4 is being kept alive by manufacturers for more than another year. This has, of course, also translated into higher prices, in a sort of reverse Moore’s Law. Plenty of somewhat dated but perfectly useful chips are already in the landfill. Under these circumstances, it would make more sense for some of us to harvest our existing ones from defunct devices, allowing Purism to price their offerings more competitively for users who don’t care about absolute maximum performance. We can always upgrade once Moore’s Law resumes.

2 Likes

I have a feeling that this is painful from a QA point of view i.e. RAM and disk have to be installed in order to confirm successful boot and to pass whatever QA tests are run … and then RAM and disk have to be removed. Even so, it’s certainly a reasonable request.

3 Likes

That did occur to me after I posted it but I figured that they would simply rely on network boot in such cases. If that’s not feasible, though, it’s still likely worthwhile from a competitiveness standpoint, in the sense of a few extra arm movements being required to pull $100+ worth of components back out, which can then by reused to test some other machine (with prudence about viruses, obviously).

The other advantage to this approach is that it would (economically) enable (1) higher capacity/performance memory or storage to be installed than would otherwise be available and (2) the use of ECC memory, which is advisable due to the mountain of research papers showing bitwise error induction via specific repeated write patterns, provided of course that the firmware support is present to take advantage of it. (Yes I realize that virtually nobody actually ships laptops with ECC, probably because the market is simply ignorant of the incipient vulnerabilities, even if only with respect to data corruption.)

1 Like

Network boot still needs RAM!!!

I think it’s also good to test the disk I/O path.

Not all chipsets support ECC at all. (So, for example, the Intel i7-10710U that is in the Librem 14 simply does not support ECC. There is nothing that Purism can do about that. It is not a question of firmware support, although possibly it would need support in the boot firmware as well.)

I agree with you though that ECC is good. I think that the need for ECC is becoming greater as capacities are increasing.

Likewise the CPU / chipset places some limits on what capacity and performance memory can be used.

You can check all of those three attributes of the memory specification on the Intel web site e.g. https://www.intel.com/content/www/us/en/products/sku/196448/intel-core-i710710u-processor-12m-cache-up-to-4-70-ghz/specifications.html

Of course you’re right that network boot needs RAM if you’re intending to boot an OS. But you could do fairly comprehensive (and arguably more efficient, targeted, and fast) diagnostics without it (yes, even packet DMA from the network device, thanks to cache snooping). This is how modern firmware has booted (but not necessarily over the network) for many years. It uses the cache hierarchy as RAM by respecting the collision constraints implied by cache line tags. Literally just being able to run through a complicated hash function and getting a particular expected output proves a lot about circuit competency. So the more I think about it, you could test a lot in just firmware, including DRAM and SSD if they happen to be present, without relying on all the complexities and vulnerabilities of the network, at all. Might be worth the effort to rig up some test functions, as it would provide for a very airtight and surgical quality assurance setup in the factory, where all you need to do is flash the firmware and reboot.

But… you might still want to flip a DRAM module into and back out of the machine, and boot an actual OS.

Ideally, yes. Then again, how many machines have bad disk IO traces but nothing wrong anywhere else that the above approach would detect? And would you really know if the path were, say, thermally unstable? All I’m saying is that, at a certain point, errors are so rare that it’s more efficient to just ship the machine and hope for the best. It’s a hard economic optimization problem, above all because you don’t have good data on expected failure rates vs. the cost of allowing some obscure problems to sneak out the door. Not saying I have a good answer. It’s a deep rabbit hole.

That being said, ECC is all about prevention (and detection if you can’t prevent, i.e. multibit error uncorrectability). Much clearer economic tradeoff than “how much testing is enough”. I’m really shocked that some Intel chipsets still don’t support ECC, so thanks for pointing that out. In the old days, ECC protected you against extremely rare memory cell flips due to high-energy particles. Then, when DRAM lines got to be noisier to due frequency scaling, it protected you from (maliciously) induced interference between proximate rows, which is much more probable than particle collisions. But now we’re close to the point where quantum noise matters (from the L1 cache to, eventually, the DRAM modules), which means your memory won’t survive completely intact no matter how shielded and secure your system happens to be. Not having ECC support (for cache as well as DRAM) in hardware is just insane. Hopefully the team will consider this in future CPU selection.

And for that matter, the most insidious thing about a lack of ECC support is that it would preferentially result in silent corruption, simply because most errors wouldn’t trigger a system fault. You could lose file integrity in the middle of a copy, and then not notice until, say, your video refuses to play years later. For now, all we can do is hash file trees before and after copying. (And you need to reboot the target machine before comparison in order to flush everything to the media. And the source machine too, so that the same files are loaded into different DRAM cells.) And not that this is the only way that a lack of ECC support can mess things up.

I think you have a rational overview of the problem. It’s just hard to know the right amount of effort and manufacturing complexity within economic constraints. Hopefully Purism will be shipping enough units that all these questions get closer scrutiny.

1 Like

(Or hopefully Intel will consider this and support ECC in a wider variety of chipsets.)

Note that requiring ECC support could currently have the effect opposite to “allowing Purism to price their offerings more competitively” because it could bias the choice to more expensive Intel chipsets (approximately speaking) and, putting cost aside, tend to be spec’d for desktop rather than laptop use. (On the Intel web site you can filter for processors with ECC support but I didn’t notice any current CPUs in the laptop segment. There are some in the “embedded” segment.)

Historically Intel only supported ECC in their Xeon platforms: Xeon CPU’s and motherboards designed for workstations.

AIUI, until recently, none of the motherboard chipsets for the Intel Core i[3,5,7,9] CPU’s (and certainly not the entry level N-series CPUs) supported ECC. And while the standard Z/B/H series chipsets don’t support ECC … some of the W (? Workstation?, eg. W680) series chipsets/motherboards support Core i[3,5,7,9] and do support ECC (starting in 2022).

I’m not 100% sure about this, but I believe that if you use “rsync –checksum”, the file is read once for the checksum, it is read once more for the transfer and the checksum is verified while writing … to make sure it is transferred correctly. I don’t think rsync checksums the written file … but if you do two “rsync –checksum” it will redo the transfer only if the input/output checksums are different. In any case, if you have done a copy … you can verify and fix it with an “rsync –checksum”.

As I said, you can jump straight onto the Intel web site, filter by ECC support and bring up a list of all the CPUs that do. There are more than used to be the case (which as you say would at one time have been limited to the Xeon range).

That’s very much a point solution for one small failure possibility though. (Judicious use of sha256sum -c ... or similar could pick up file copy corruptions. When I am archiving Linux distro .iso files I do actually do this since the hash file already exists.)

Random memory corruption is going to be difficult to deal with in a general way without ECC.

It is vaguely possible that the normal TCP/UDP checksum could detect memory corruptions in network communication provided of course that you aren’t using offload (TOE).

And if you happen to be using LUKS with authentication (NB: marked EXPERIMENTAL) then some memory corruption will be detected when the disk block is read back.

And with some file system types some memory corruption would be detected by the checksumming in the file system.

And any file that is digitally signed and stored that way (e.g. email) and corrupted by dodgy memory at any point should be detected.

Not to mention that some memory corruption will just crash or hang your system … and you should detect that. :wink:

But let’s say that the logic in userspace application code is if variable then do_one_thing else do_another_thing and variable is stored in memory, at least at one point in its lifetime, and the variable is thereby corrupted in such a way that the wrong path is followed …

My point is that while the newer Intel Core CPUs do support ECC, that’s not enough. It’s not just about CPU’s. You need both the CPU and the motherboard chipset to support ECC. For example, https://www.intel.com/content/www/us/en/products/sku/236773/intel-core-i9-processor-14900k-36m-cache-up-to-6-00-ghz/specifications.html supports ECC but most of the motherboards you buy for it will not. e.g. The Z790 based motherboards don’t. You’ll need to get a W{\d3} based motherboard.

i.e. Xeon and “motherboards for Xeon” support ECC –> that was their main raison-d’etre. But we’ve moved to where a CPU might support ECC, but most of the motherboard chipsets don’t.

Sure, but the previous poster was talking about one case and I was trying to helpfully point out that one could do copies in a way (“rsync –checksum” twice) to avoid that case. It’s certainly just as time-consuming (CPU-wise) as doing what the previous-poster said … but it’s completely built into the process.

Most of those are designed for “bitrot” detection rather than memory corruption detection.

Sure. None of the various incidental technologies that I mention are designed specifically to detect erroneous results arising out of ‘bad’ local memory. It is only by good luck if one of those technologies happens to be affected by bad local memory in such a way that it leads to detection. (And that also applies to rsync.)

That’s why you need ECC.

The only other legitimate widely available technology for this purpose is Power-On Self Test (POST), which can include verifying some or all of main memory. (That isn’t perfect as it is a one-off, and it is relatively expensive in boot time.)

And GRUB or other similar bootloader may include a memory test option (but that must be explicitly chosen by the user, usually after the user suspects there is a problem, which in turn may be after it is too late).

It is surprising that cp itself hasn’t grown some kind of verification option (whether that be by reading everything back to compare byte for byte equality or by using a hash). But if you need that then a wrapper shell script is your friend.

Any verify after copy can fail horribly if the source or destination is or includes a special file
e.g. /dev/random (as an extreme example) so it does need to be an option.

This is an actual serious problem. Like what are we supposed to do? Stick to DDR4 so the bit cells are fatter? But then maybe they’re actually more susceptible than smaller-geometry DDR5 chips because they were designed will less foreknowledge of maliciously induced corruption. I don’t know but maybe the only solution is to push ECC into the next Librem Mini with an actual desktop chip. At least that would be progress. Or have a high-power spin of the laptops for users who prioritize reliability over portability.

That’s probably helpful but when something is read twice, the OS will often be able to recall it from the file cache in main memory, so if that’s corrupt then you’ll read the same corrupt DRAM line twice. Hence my recommendation to reboot both machines.

Exactly. This is the whole idea with malicious row corruption due to unprivileged but deliberately constructed writes from a user process (generally involving at least 2 threads). In theory you might be able to get the kernel to do something unauthorized. If you were to attack a large number of machines, you might be able to get one of them to bend to your will, even perhaps indirectly by inducing dangerous row interactions from the network stack on the target machine without actually pwning it, which could perhaps be enough to gain a foothold on the entire network. Empirically, this hasn’t been very successful to date but better just to sidestep the entire debate by employing ECC where possible. Moreover, you never know when a kernel change or DRAM architectural change might occur which suddenly makes the attack more feasible. The lack of laptop support suggests to me that the market is still overwhelmingly ignorant of these risks.

Good point. This is why it’s been eliminated from modern firmware. It wastes more productive human time than it saves, especially because people tend to reboot rather often. At most, it should be a firmware setup option. Even then, it’s highly unlikely to detect modern DRAM failures, which usually require constructed row conflicts between different cores, ideally under higher temperatures that probably wouldn’t be the case during POST.

I would speculate that this goes back to Linus’ frustration over some kernel option that proposed to allow bypassing the cache hierarchy during reads or writes. I don’t recall the details but the idea was that you could tell the kernel to read or write directly between the media and main memory. The problem, of course, is that it’s caches all the way down; Linus’ point was that the kernel has no business knowing where all the DRAM is in the entire chain, starting with the memory cells on the storage unit itself, which might as well be on the other side of the world. So the proposal was killed and I suspect it never came back from the dead, hence the lack of “–verify” with cp; it wouldn’t do what you’d want it to do because it doesn’t know where the bits are ultimately stored.

I believe it is possible to read bypassing the file cache / to read asking that the file not be cached / to evict the file from the file cache before making the second pass.

Another risk is that if memory is actually faulty (as distinct from an intentionally induced temporary fault) then even actually reading the file twice may just get you the same wrong in-memory contents twice i.e. a consistent bit error. That’s where LUKS with authentication should save you.

Totally. POST is for detecting actual faults, not attacks.

I think whether POST is too slow if it checks main memory depends on the scenario (frequency of reboot) and user priorities. Servers and phones might in theory only get rebooted on operating system upgrade i.e. infrequent enough that slow POST is tolerable. Desktops and some laptops might use standby in preference to shutting down.

I think the actual fix for this kind of attack is to render this kind of attack impossible - rather than using ECC to (attempt to) detect when memory gets corrupted. But ECC is still a good idea.

Indeed. You need to separate out “Purism devices” that never run on battery and those that sometimes do and those that basically always do - because the compromises that you might want to make are different.

Indeed. Cache inside the storage unit. File Cache. Various levels of CPU caches in front of main memory.

That’s not quite true. A cp that verified by re-reading source and reading destination and didn’t even bother to bypass the file cache could still detect some memory errors e.g. maybe it was the copy written to the destination that was corrupted. (As the source and destination are assumed to be distinct files, in the file cache they ought to be stored separately and independently. Or if the cache technology is such that one file happens to evict the other then that’s OK too because then one file will have to be re-read from underlying storage.)

1 Like

I wouldn’t trust that cp or whatever app actually knows how to ensure this. If it matters, just reboot both machines before you compare.

Fortunately, I think this is unlikely to the point of being effectively impossible. The reason is memory mapping randomization (which was implemented for security, not actually error detection) on top of timing randomness.

I’d stick with the latter. Attempting to detect memory corruption is really expensive. Either you do it sparsely, in which case you can only detect that a given machine is having chronic issues and may have already trashed your data, or you do it comprehensively, taking hashes all over the place, which is prohibitive. I don’t think you can render row-based attack impossible. We don’t even know all the ways it can happen, which is just some fuzzy statistical transformation of physical geometry, which in turn depends on brand, SKU, temperature, frequency, voltage, etc. The problem is that it generally doesn’t involve anything privileged, just more than one thread.

More importantly, they differ by customer, not only device. Given the sort of customer buying from Purism, battery life and price are generally secondary to security and stability. But of course there are limits. I wonder if ARM has a mobile ECC chip, for that matter. I don’t think anyone could stomach having an Intel desktop chip in a cell phone. Not even me.

Indeed. I’m not against cp having a halfway-competent verification function. That might at least tell you if you’re having chronic failures. But if it mattered, I would just reboot to force a flush and reread.

True but I am talking about cp growing a verify option - so cp remains running without much change in memory layout between the original copy pass and the verify pass.

Probably not but I think dd has the right options, based on reading the man page, not based on any actual testing.

1 Like