Sacrifice HyperThreading to smash Spectre?

whistler · September 13, 2020, 2:19pm

It would be nice to have a switch, whether mechanical, in the firmware, or in PureOS, to disable HyperThreading. Year after year, the Spectre family of speculative execution exploits just keeps growing, with BlindSide being the latest addition. Enough is enough! Even virtual machines just raise the bar; they don’t fix the problem.

With the product of core count and memory bandwidth per core growing faster than total memory bandwidth, a bottleneck is forming. This is actually good news in the sense that each individual core has more time to get the same amount of work done. So why not allow the user to disable HyperThreading in order to more or less disable Spectre attacks? (Only disabling speculative execution entirely would completely kill it, but Intel seems fanatically imbued with its assumption that, after a million workarounds, Spectre will finally die and enough of speculative execution will remain to provide marketable performance. This is a fantasy, but they’re not gonna listen to me. AMD, for its part, is only marginally better off due to architectural differences which make it more difficult to attack – a few more weeks of development for a sophisticated attacker.)

Equivalently, this a call for lower-megahertz more-core CPUs. CPUs are so far behind GPUs and TPUs that more megahertz doesn’t provide bang for the buck. It just kills the battery faster and makes fan acoustic annoyances more frequent. Another half gigahertz just isn’t worth the hundreds of dollars that CPU vendors demand, but I’d frankly be willing to pay for that needless upgrade if it meant that I could have more cores, and thus disable HyperThreading with less perforance loss. Fortunately, it seems like any given core count is available at a variety of speeds.

In any event, this request is directed at future computers or OS releases from Purism, whose customers are fairly unique in that they tend to place security above convenience. Meanwhile, if anyone has an elegant hack for accomplishing this on existing platforms, do tell.

lperkins2 · September 13, 2020, 7:16pm

gist.github.com

https://gist.github.com/samueljon/e7818edeb218f5e2f1e3e258949d04c8

toggleHT.sh

#!/bin/bash

HYPERTHREADING=1

function toggleHyperThreading() {
  for CPU in /sys/devices/system/cpu/cpu[0-9]*; do
      CPUID=`basename $CPU | cut -b4-`
      echo -en "CPU: $CPUID\t"
      [ -e $CPU/online ] && echo "1" > $CPU/online
      THREAD1=`cat $CPU/topology/thread_siblings_list | cut -f1 -d,`

This file has been truncated. show original

tracy · September 14, 2020, 4:45pm

Isn’t SPECTRE also a James Bond arch-enemy? (Yes another one of my unrelated comments, sorry couldn’t resist.)

fsflover · September 14, 2020, 5:35pm

Software solution is sometimes not enough. Qubes OS disables HyperThreading for security reasons, but if it is still on in BIOS, there is increased power consumption after suspend/resume: https://github.com/QubesOS/qubes-issues/issues/5210. It would be very helpful to be able to switch off HT in BIOS on my Librem 15.

kieran · September 15, 2020, 5:27am

HT is only part of the problem though. Speculative execution is the other part and perhaps the bigger part - and the reason for the name.

Intel doesn’t provide any means of disabling speculative execution, AFAIK.

whistler · September 15, 2020, 1:48pm

@lperkins2 Beautiful! For those who haven’t looked at it, the code seems to be echoing “0” to /sys/devices/system/cpu/cpu[X]/online in order to disable HT, or “1” to enable it, where X is a member of the set of numbers in /sys/devices/system/cpu/cpu[X]/topology/thread_siblings_list. (Note that the file size of “online” is much larger that one byte, so watch your padding bytes in case it matters.) Unfortunately, the code doesn’t work if the latter file has ranges like “4-5” instead of lists like “4,5”. It’s probably sufficient just to kill off all the odd numbered cores (and I guess reboot in order to enact it).

@fsflover For those with a Librem machine, can you explain how to disable it in the firmware?

@kieran I agree, and I admitted as much, but your repetition is justifiable because Spectre is so underappreciated for that the threat that it actually is.

The truth is that, with enough cores, we could kill all speculative execution and not lose much in the way of performance (as though that’s even a consideration when the alternative is to lose security, but people are usually lazy). But we don’t even need to go that far. The basic choices are:

Make hardware changes to enable speculation without side effects. In other words, it can use data already in the L1 cache, but it’s not allowed to modify the state of any cache at any level until the code path is no longer speculative. This has been well described in the literature, but would take an enormous (de)engineering effort.
Make a set of instructions which enclose a piece of code to be executed without speculation. This is pretty straightforward to implement because it’s already halfway done with serializing instructions such as CPUID and MFENCE; it’s just a matter of making “start serialization” and “end serialization” instructions. The hard part is knowing when you’re dealing with sensitive data, and when you’re not. By and large, you are. So compilers would need to shut off speculation everywhere by default, and only not do so when the programmer explicitly indicates that it’s safe to enable it.

The first option accelerates at the rate of Moore’s Law for L1 cache size (pretty slowly). The second option starts slow and gets faster with or without silicon process improvements, as programmers optimize snippets where security isn’t an issue.

The default option is to keep falling over cliffs every other month for the next few decades until the problem is secure by virtue of the sheer complexity of evading all the mitigations simultaneously. This is the dumbest path and also the one that implies minimum mental exertion for CPU vendors. In other words, it’s the one approach you can count on actually happening.

lperkins2 · September 15, 2020, 3:48pm

In theory, yes. In practice, most of the attacks steal data from other “hyper threads”. First, because the CPU is more likely to inter-mingle instructions from two threads than it is to do something bad with out-of-order execution. Second, because hyperthreading is more predictable in effect, as it doesn’t require miss-training the hardware branch predictor. Third, because the other hyper thread is more likely to be a kernel task or some other valuable target. Out-of-order attacks have an issue with stealing their own data.

This is actually related to an additional mitigation you can employ for servers and other systems where high latency, high throughput is fine. The kernel scheduler can assign CPU cores to user/cgroup pairs, and then not let them mix. In conjunction with making the scheduler fully (or nearly fully) cooperative-multitasking, you can make CPU side-channel attacks nearly impossible (for a process to steal anything but its own data).

lperkins2 · September 15, 2020, 3:53pm

You can’t count on it being the odd numbered cores. It sometimes is the first N/2 cores (depends on kernel version and CPU vendor and architecture). It is unlikely to change on any given machine (especially if you’re already running a 3.x or later kernel). So once you verify that it’s the 2X+1 cores, you can probably just hardcode that for your machine.

kieran · September 16, 2020, 6:03am

How would a programmer ever know that? Imagine that you are writing some code and that code will be used by someone else’s code. You therefore don’t know whether you are handling sensitive data or not. Only outer level application code could ever know that.

In some sense, a requirement to know whether you are handling sensitive data would go against the value of code reuse.

How does that solve a problem? Your code may be handling sensitive data but the problem with speculation is in the execution of other code that is executing in parallel with your code, isn’t it? In that case, some kind of speculation barrier that worked across all CPUs might do the job but

a) that would be bad for performance (could even be a kind of DoS), and
b) it almost implies that speculative execution can be disabled, in which case just disable it permanently.

I would be tempted by this option - speculative execution must not produce any externally visible changes in any machine state (not just Lx caches, everything) - but I suspect you are right in suggesting that that would be a lot of re-engineering.

Note also that speculative execution must not bypass any access checks. If a required access check cannot be made based on the current state (e.g. address translation not available) then speculative execution must stall. By virtue of the previous paragraph, speculative execution can’t cause the operating system to supply an address translation (and thereby change the contents of the TLB).

It is more complicated than that. Sometimes a user needs to do something that can only be done well by a single, very fast CPU. More cores works well when that is suited to the workload.

In any case, speculative execution is really about keeping the pipeline filled, whether that’s the N pipelines of N cores or the one pipeline of one core.

whistler · September 18, 2020, 2:39am

@lperkins2 Thanks for the clarification. At least, it sounds like a hack that should deliver consistent results on a given machine. Useful advice indeed.

@kieran After reading your feedback and thinking more about this, perhaps the right approach is to enforce (non)speculation by memory type, perhaps via TLB bits or even memory type range registers. There could be a type that says “writeback nonspeculative”, which means that, while data in that region can indeed be writeback-cached at any level, it can’t be (un)cached until it’s certain to be (un)needed – nor evicted from any level, to any other level, or even to an intermediate (victum) cache – except by eviction due to some other process which may or may not have speculation enabled, but in any case, cannot induce the original process to misspeculate on an access address.

(In reality, “region” almost certainly means “page”, and probably some overhead would be necessary in order to transition between regional cache policies, but nothing so burdensome as a cache flush of the area in question.)

This would ensure, for example, that the memory space of a given process would prevent speculative cache side effects while also ensuring that such prohibition would continue to be enforced even if, say, a pointer into that region were passed to the kernel (or a fork of the original process which inherited its regional speculation policies) for further processing.

Furthermore, the data could subsequently be copied into a “writeback speculative” region, such as a disk IO buffer, provided that subsequent handling of the data (i.e. the codepath induced by the data) did not depend on its content (which should be the case in such a buffer).

Ultimately, the choice of whether or not to enable speculation would have to be made by the user – not the programmer. Standalone game consoles would enable it, while generic networked workstations generally wouldn’t. But perhaps I could say that my DVD player app could use speculation because I don’t care if I leak my movie library, while my banking app can’t use it because it requires maximum security.

Clearly, it’s tough in practice to say “App A will never access sensitive data but App B will”, so this is more like an optional way to restore all the benefits of speculation when the default should be to disable it universally.

I would personally be happy to see speculation tossed on the scrapheap of history. Those few apps that need fast single-threaded performance should be rewritten, abandoned, or as a last resort, run in multiple instances in parallel in order to boost throughput despite high latency.

lperkins2 · September 18, 2020, 3:05am

Not really a good solution… Increasing the time it takes my CAD software to render or my CFD simulation to run is expensive. It also easily makes the difference between eating dinner with my family or not. I am absolutely not worried about sidechannel attacks, since I don’t do sensitive tasks on the same machine as ever gets to run untrusted code. (And with raspberry pis available from $5, there’s no reason to do your banking, email, and so on, on the same machine as you play games, run random software, or have other users).

Speculative execution and other side channels are a serious consideration in the VPS realm, as the cost savings come from under-provisioning the physical resources, which requires efficient saving.

kieran · September 18, 2020, 6:35am

Anything that is going to require changes in silico means that we should consider other alternatives:

fixing speculative execution to make it safe
allowing it to be disabled

I would go further and suggest that if Intel does the first item above then they should also do the second item so that, when the likely inevitable bugs are discovered, customers for whom this is a serious issue can at least sacrifice performance in order to get security.

The idea of attaching some bits to memory to indicate its safety warrants further study but to be honest I couldn’t see exactly how that would work. Is this for the memory that holds the code? or the data being worked on?

It doesn’t feel safe to me because, at the very least, not all such bugs need relate to memory being worked on. Any leakage of state that depends on the data being worked on, which could be in a register, might be a concern.

Also, the boundaries of when safety against speculation is no longer required are not obvious to me. At the very least, it might require all code to be diligent about clearing sensitive material (from memory and registers) when it is no longer needed. Probably neither existing code nor compilers are set up right now to do that. The assumption is that a certain non-leakage security model applies so that such clearing is not needed.

As @lperkins points out, the main concern regarding this class of bugs is in a multi-user environment, which includes a computer that is used concurrently by two or more users and even more so includes a virtualized environment. In a single user environment, if you don’t run untrusted / random code then you should be safe from this class of bugs.

We may have to agree to disagree on this one. In addition to what @lperkins said, the basic point is that speculative execution potentially accelerates all cores. So if you have a CPU with 8 cores and your algorithm parallelizes well, you are punishing all 8 cores and getting a certain level of performance. If you then disable speculative execution, all 8 cores potentially slow down and your level of performance declines.

I guess your point is … kill it with more cores. Gain back the lost performance by tossing out the 8 core computer and getting a 12 core computer but that may be impossible / impractical / financially unreasonable.

vmedea · September 18, 2020, 2:41pm

I don’t think this is reasonable with current Intel/AMD CPUs, but OTOH, without the extensive, complex, logic for speculative execution, a lot more cores might fit on a die. So on the long run it’s an idea. I kind of like the Phalanx experiments with a swarm of 1024 simple cores. Of course that requires very specific software and some compromise has to be found between single-thread and multi-thread performance.

Anarchy-X · September 18, 2020, 6:37pm

Isn’t spectre only an issue on machines with multiple users? i.e. VPS, shared hosting, etc.

lperkins2 · September 18, 2020, 6:50pm

That point was made 2 posts above. That said, it is worth pointing out it also includes running untrusted code (like steam games, or Javascript), even in a sandbox.

Fan · September 19, 2020, 7:15am

I think until we have new Spectre-proof CPUs, and that looks to be many years in the future, then there is only one other admittedly radical, but possibly solution for high security systems that I can postulate. Let’s back up.

So we have hardware that leaks secrets. That’s what we’ve got to work with! I think we need to start all over. And I mean really, all over. How do we proceed from here to use this broken hardware securely?

Idea 1: don’t put any secrets on it. Well, this would work, but aaaarg.

Idea 2: write software that can be trusted on this broken hardware.

a) This software will need to be simple enough to make 100% sure we don’t have any bugs in it.

b) It will also need to be careful enough to be 100% sure that any data it gets can not be used to break into it.

Can this be done? Yes, of course. This stuff exists. It’s just that we call it primitive. lol. It means throwing away most of what we have today, and starting over software wise, from ground zero.

We need minimal, crisp, clear, open source, well documented (and well commented, so it is more easily readable by more people), new, tiny software.

Many years ago I ran a small Forth operating system on my Z80 machine. Booted directly into it off the hardware. Later I wrote a multi-tasker for it. I was fully keyed into what every single line of code did. And I had a test suite for almost every line of code too. There is something about small, in that you can really get your hands around it much better.

In other words, I’m suggesting running a small, tight software ship to deal with the most important security issues.

You might say that this is impossible given the complexity of modern software. I would agree. But what I’m saying is that it IS possible to write sufficiently simple, and reliable code to get some important things done. Especially if there is lots of testing, lots of simplicity, and lots of code review. Think mission to Mars type software.

If you have code that you can fully trust, and you fully know what each and every line does, and you have thoroughly tested it, and if it filters all input from outside sources to make certain that it is what you think it is, and thus can do no damage, then you can avoid an attack, even on an insecure CPU.

But even with Linux today, with it being fully open source, it’s impossible really to do anything like a complete, careful code review. Each part is written by one or more experts, and most of the code I’ve looked into has no comments to help others understand what it’s really doing, that is without almost pulling your hair out to read the code and guess at what various parts of it do.

So that’s my radical idea. Go ahead and shoot it down. But if you do, then come up with something else better. Thanks.

reC · September 19, 2020, 10:45am

the question is : will the people that are ABLE to write such code be motivated enough to do it ?

i’ve seen web-sites that are so minimal that practically all my FireFox ESR add-ons extensions don’t report anything to block and the web-site loads perfectly well with or without them.

such an example would be > lukesmith dot xyz/index.html

Fan · September 19, 2020, 4:44pm

Here’s a real world example in the direction of: small, tiny, clean software.

This is a minimal, tiny, security password generator: MightyTiny_Password_Generator

I wrote it a couple of years ago for myself and others. I wanted it to be as secure as I could possibly make it. It is a re-write of a much larger version of roughly the same thing that was over 10 times it’s size and complexity, unnecessarily in my opinion. You might notice that there are now only about 100 actual lines of code. Also it has about 100 lines of comments to explain that code.

Admittedly, it runs on top of a huge pile of now untrusted software: JavaScript in Firefox on Linux, on Intel microcode.

But it’s just to give you an idea of what I’m talking about, of being 100% bug free, and 100% well commented to explain what it’s doing so more people can understand and read it’s code. … And ultimately trust it.

whistler · September 19, 2020, 5:54pm

I must say I’m encouraged by the flow of novel ideas here, some of which perhaps being viable. Even the CPU vendors will listen to reason, given an overwhelmingly convincing proof of their incompetence and a map out of hell.

First of all, the reason I advocate killing speculation entirely is: (1) it will kill Spectre; (2) while – yes – all cores will slow down, most day-to-day workstation problems become bounded by frontside bus speed (or IO speed) when run at suficient scale, so we have more and more time to spare as the core count increases; (3) as @vmedea pointed out, it would free up silicon real estate; and (4) doing so does not preclude the existence of an asymmetric architecture with a few brawny cores and lots of whimpy ones, as for example we could simply boost the cache sizes of the former, to great effect.

However, I’m also sober about CPU vendors’ assumptions that, even when security matters, it’s likely to be overridden by the dodo birds in purchasing who want to impress the boss with the highest performance tools. In game theory terms, this is a tragedy of the commons: everyone benefits if everyone does the right thing (kill speculation, add cores, and let Moore’s Law wash our tears away); but the bad guys win if only one good guy disables it, because the buyers are generally ignorant of the tradeoff; so in the end it’s likely that no one disables it. At best, as @kieran suggested, it’s possible that we’ll get a kill switch (which would require all sorts of yucky interprocessor synchronization, but might just work). Given that, or nothing, I’ll take it. I will pay more for a CPU that can disable this clowning, even if the OS hasn’t quite figured out how to do it properly yet. (No doubt Intel is hanging on my every word!)

I do like @Fan 's suggestion that we reboot the entire industry (yes, it might take that) from secure primitives and all the way back up to full stacks, but I don’t see it happening because laziness is the way of the world. At best, perhaps we could force contracts on apps around timing, e.g. “The OS agrees to tell you the wall clock time if you agree that you will be terminated before the next task switch occurs.” But the devil is in the details. I for one don’t see how to provide real timing information (even local timing) to an app without risking that it learns something about neighboring processes. Even the idea of constant timeslices doesn’t work because I can still count how many memory accesses I can do within such a slice, and thereby glean such information.

But let’s rewind a bit, as it’s important to remember what Spectre is: it’s a family of security vulnerabilities which exploit timing information to leak data. The only difference between one Spectre sibling and another is the manner in which they induce such information to leak, and across which particular wall they seek to leak it. The archectypical exploit goes like this: (1) using conditional branches at locally controlled addresses, prime the branch target buffer (BTB) so that the code on the other side of the wall will mispredict a branch (which is possible if the attacker knows the way local addresses map to BTB entries, which is in general dangerously simple, so even address space randomization (ASR) doesn’t help much), causing it to use a sensitive piece of data to generate a speculative memory address; (2) after many iterations of this experiment, and using timing information provided by the OS, determine which lines have been (un)loaded to or from the cache as a result of that misprediction, simply by measuring memory latency to relevant addresses; (3) work backward from the cache tags corresponding to the affected lines (which are deterministically, if weakly pseudorandomly, generated from the target addresses) to determine what sensitive data had been misinterpreted as an address; and (4) repeat until sufficient data has been extracted, such as all the bytes of a password. This is best done with small granularity timing measurement, but over longer periods of time, perhaps spread across multiple sessions, no amount of jitter is enough; even Javascript will suffice. Unfortuantely, a lot of effort has been put into decreasing timing resolution in order to thwart Spectre. This slightly raises the bar but cripples a lot of productive apps which require finer timing. It’s a mistake that should be reversed.

The easiest mitigation would be prevent to the BTB from being primeable in any useful way. One popular “solution” is to reset the BTB state when switching tasks, so that all priming will have been forgotten upon arrival at the other side of the wall. When the original task is resumed later, the BTB will need to retrain, which will cost only a small performance penalty. Unfortunately, this doesn’t work, either, because the attacker can just choose those branch locations in the target binary which will be primed the wrong way simply by virtue of the BTB reset. (With no BTB history for a given branch instruction, speculation should be disabled the first time that the branch is encountered. But AFAIK it’s just set to speculate (un)taken – whatever the default happens to be.) If too many silicon changes would be implied by making the first branch nonspeculative, then a better approach would be to randomize (un)taken on that first iteration. So when switching tasks set all the BTB predictions pseudorandomly (in a trapdoor way which is not predictable to an observer at any privilege level). That way, an attacker couldn’t easily determine whether specific cache lines had been (un)loaded due to bona fide accesses or misused data. (This isn’t quite solid, which I’ll get to below.)

Another approach would be to create a per-task random mapping of memory address hashes to cache lines – in order words, a per-task cache tag scheme. This is a mess to implement, though, as it would require memory accesses in order to redirect other memory accesses, rather like a translation lookaside buffer (TLB) lookup – but that might create yet other Spectre vulnerabilities, so we would be stuck with an unwieldy register implementation. It’s theoretically possible, but much harder than the BTB route, and its only advantage is a lack of BTB disturbance.

The BTB randomization approach is a very high bar, but perhaps the code under attack has multiple potentially leaky branches, so in theory the attacker might just have to look at a larger number of cache lines over more wall crossings in order to ascertain the sensitive data. This is why I think the page approach is the most robust, short of a speculation kill switch. It disallows cache line (un)loading until the CPU knows what all preceding branches should actually have done. For that matter, speculation isn’t actually disabled; only memory accesses are, so register computations including floating-point could still happen speculatively, which still helps performance. It does, however, prevent data being misused for memory access, so there will be no information provided by the cache state after the fact, provided that the attacked code follows a data-agnostic execution path, as any competent crypto library surely does. (This means you don’t: touch matrix row 0 if the next key bit is 0, or row 1 if it’s 1; instead, you touch both of them and just nullify one of them after loading into registers, such that total execution time is essentially constant and memory access pattern is always the same, assuming no speculation.) Speculation could only be enabled by driver code that didn’t touch the data, but instead pointed a DMA engine to it for IO transfer – or for machines without access to sensitive data, whatever that means to the user. It’s also possible under this scheme to have trusted and untrusted apps, which speculate or not, respectively, but that would require user intervention and considerable due diligence, above all because even trusted engineers are themselves still susceptible to compiletime malware injection.

OK so where’s my speculation disable bit?! Oh I forgot, that was only in a dream. Maybe the next best thing is to have a really obtuse compiler switch (ahem, GCC people?) that inserts memory fences prior to all the branch instructions. (I seem to remember that even that doesn’t necessarily work, in which case we’d need bona fide serializing instructions like CPUID and friends. Gross. But maybe acceptable if we still use hardware acceleration for most of the crypto (AES) tasks. At least that would cut down on the most damaging attacks.)

reC · September 19, 2020, 7:02pm

what you are describing here is above my pay grade i’m afraid and perhaps will be for many years to come

what you seem to describe above is pretty advanced for a man to execute given that this sort of attack has only a very narrow time-frame where if executed properly will output the most damage to the target system(s). it would seem to me that this type of low level attack could only be performed by specialized and trained AI which would indicate a very high level of sophistication/resources. maybe i’m wrong but it definitely only seems to be possible when DIRECTLY exposed to the www.